Press "Enter" to skip to content

Curated SQL Posts

Broken References In SSISDB

Andy Leonard explains how broken environment references can come into being within the SSIS Catalog:

If the reference was broken after the SSIS package execution was scheduled, we may see an error similar to that shown below in the SQL Agent log for the job step that attempted to execute the SSIS package:

Failed to execute IS server package because of error 0x80131904. Server: vmSql16\Test, Package path: \SSISDB\Test\ParametersTest\SensitiveTest.dtsx, Environment reference Id: 35.  Description: The environment ‘env2’ does not exist or you have not been granted the appropriate permissions to access it.

Andy has an explanation of what these are, how you might find them, and how to fix them.

Comments closed

Sankey Custom Visual

Devin Knight looks at the Sankey visual in Power BI:

In this module you will learn how to use the Sankey Power BI Custom Visual.  The Sankey is a type of diagram that visualizes the flow of data between a source and destination.

Sankey diagrams are among the most information-dense diagrams out there.  They aren’t general-purpose diagrams, but for someone willing to take the time to unpack them, they can be quite informative.

Comments closed

R + Power Query

Ryan Wade makes his argument that R can be more powerful than M inside Power Query:

I want to leave you with two more things. If you look at the trade balance data set you will notice that it is not in a good format for data analysis. Here is a link to the file if you want to take a closer look. When you are doing data analysis you want your data to be in a “tidy” format. A “tidy” format means that each column represents a variable and each row represents an observation. To make this data set “tidy” you need to reformat the data into the following format: Country, Year, Trade Balance, Exports, and Imports.

This was an interesting example.

Comments closed

Build A Python App Which Connects To SQL Server

Steve Jones walks through a Python tutorial:

However, there were other errors, which I suspect are related to Python 2.7 v Pyhton 3.5. Rather than solve those, I went on to the columnstore demo. In this, you create a table with 5mm rows and then run a query against it from Python. I did that, then created the columnstore index, then ran it again. The results are below.

And within an hour or so of starting, Steve has hit the 2.x vs 3.x mess in Python.

Comments closed

Azure TCO Calculator

James Serra links to the Azure Total Cost of Ownership calculator:

For a long time clients would ask me how to determine the cost savings by migrating their applications and databases to Azure.  I never had a good answer until now: The Total Cost of Ownership (TCO) Calculator.  Now in preview, just provide a brief description of your on-premises environment to get an instant estimate of the cost savings you can realize by migrating your application workloads to Microsoft Azure

I think this is a useful start, though a big part of the value I see in Azure is moving certain things from IaaS to PaaS, like getting rid of the web servers and going to Azure Websites.

Comments closed

Constrained Delegation

Regis Baccaro shows how to allow non-domain admins to configure Kerberos Constrained Delegation:

Now I need to add some special permissions to computer objects, so I click Add again. Once again, I’ll select the DBA group, then I need to switch to Descendant Computer objects. I click Write and then scroll down until I see Validated write to service principal name. I’ll click the box to enable it, and then OK, OK, and OK.

The end result looks like below :

2 permissions for DBA group,

  • All descendants objects : Write all properties

  • Descendant computer objects : Validate write to Service Principal Name

Regis has the whole process documented well, so check it out.

Comments closed

Avoiding Statistical Mistakes

Adrian Sampson explains some common mistakes in statistical analysis, particularly in computer science papers:

It’s tempting to think, when p \ge \alphapα, that you’ve found the opposite thing from the p < \alphap<αcase: that you get to conclude that there is no statistically significant difference between the two averages. Don’t do that!

Simple statistical tests like the tt-test only tell you when averages are different; they can’t tell you when they’re the same. When they fail to find a difference, there are two possible explanations: either there is no difference or you haven’t collected enough data yet. So when a test fails, it could be your fault: if you had run a slightly larger experiment with a slightly larger NN, the test might have successfully found the difference. It’s always wrong to conclude that the difference does not exist.

It’s an interesting read.  H/T Emmanuelle Rieuf.

Comments closed

Log Aggregation With Kafka And Redis

Asaf Yigal has a two-part series on comparing Apache Kafka and Redis for moving log events into Elasticsearch.  Part 1 explains the technologies:

Redis is a bit different from Kafka in terms of its storage and various functionalities. At its core, Redis is an in-memory data store that can be used as a high-performance database, a cache, and a message broker. It is perfect for real-time data processing.

The various data structures supported by Redis are strings, hashes, lists, sets, and sorted sets. Redis also has various clients written in several languages which can be used to write custom programs for the insertion and retrieval of data. This is an advantage over Kafka since Kafka only has a Java client. The main similarity between the two is that they both provide a messaging service. But for the purpose of log aggregation, we can use Redis’ various data structures to do it more efficiently.

Part 2 compares the two technologies and explains which works better when:

Kafka heavily relies on the machine memory (RAM). As we see in the previous graph, utilizing the memory and storage is an optimal way to maintain a steady throughput. Its performance depends on the data consumption rate. In the case that consumers don’t consume data fast enough, Kafka will have to read from a disk and not from memory which will slow down its performance.

As you might expect, the answer for which technology to use is “it depends.”

Comments closed

Apache Ranger On ElasticMapReduce

Varun Rao explains role-based access control using Apache Ranger on Amazon ElasticMapReduce:

Using the HUE SQL Editor, execute the following query.

These queries use external tables, and Hive leverages EMRFS to access the data stored in S3. Because HiveServer2 (where Hue is submitting these queries) is checking with Ranger to grant or deny before accessing any data in S3, you can create fine-grained SQL-based permissions for users even though there is a single EC2 role specified for the cluster (which is used by all requests the cluster makes to S3). For more information, see Additional Features of Hive on Amazon EMR.

If your job includes securing a Hadoop cluster, this is a nice read, even if you don’t use EMR.

Comments closed