Custom Visuals: Chord

Devin Knight has part nine of his custom visualization series:

In this module you will learn how to use the Chord Power BI Custom Visual.  Chord diagrams show directed relationships among a group of entities using colored lines (chords); this allows for an easy representation of correlating data.

Chord diagrams, when done right, can be extremely informative.  The problem is that they’re also really confusing to the uninitiated.

Error Severity Levels Greater Than 18

Manoj Pandey was debugging an Informatica ETL and got back an uncommon error message:

So, to identify the cause I tried to execute the above MERGE statement directly and I got the same error:

EXEC spMergeTables 'STG.ABCtblXYZ','ABC.tblXYZ'

(0 row(s) affected)
Msg 2754, Level 16, State 1, Procedure spMergeTables, Line 107
Error severity levels greater than 18 can only be specified by members of the sysadmin role, using the WITH LOG option.

This is a case in which an immediate error obscured the actual error.

Stopping Integration Services Packages

Andy Leonard explains various methods of stopping SSIS packages in progress:

Once you have the operation_id value, simply plug it into the stop_operation stored procedure and execute:

exec SSISDB.[catalog].stop_operation @operation_id = 24

The stop_operation stored procedure runs for a few seconds (typically less than 15 seconds) and stops the execution of the SSIS package. SSIS packages that have been stopped are listed with “Canceled” status. You can see operation_id 19 was stopped in the screenshots shown above.

Read on for more.

Groups Of Basic Availability Groups

Russ Thomas looks into whether you can combine Basic Availability Groups in a way which mimics Enterprise Edition’s Availability Groups functionality:

For a recent project that required HA/DR but couldn’t justify Enterprise edition we decided to take the plunge on 2016’s Basic Availability Groups.

For a quick rundown of the watered down feature set – basically what you don’t get with a Basic Availability Group (BAG) – the major points are as follows:

  • You can only have 2 nodes.

  • Only one database can be in the group.

  • You can not have the secondary be in read only.

  • You can not take backups from the secondary.

The answer is “yes” but it’s not easy.  Read on for more.

Securing Kafka Streams

Michael Noll shows security features of Kafka Streams:

First, which security features are available in Apache Kafka, and thus in Kafka Streams?  Kafka Streams supports all the client-side security features in Apache Kafka.  In this short blog post we cannot cover these client-side security features in full detail, so I recommend reading the Kafka Security chapter in the Confluent Platform documentation and our previous blog post Apache Kafka Security 101 to familiarize yourself with the security features that are currently available in Apache Kafka.

That said, let me highlight a couple of important Kafka security features that are essential for implementing robust data infrastructures, whether these are used for building horizontal services at larger companies, for multi-tenant infrastructures (e.g. microservices), or for shared platforms such as in the Internet of Things.  Later on I will then demonstrate an example application where we use some of these security features in Kafka Streams.

It’s important to secure sensitive data, even in “transient” media like Kafka (though the transience of Kafka is user-definable, so “It’ll go away soon” isn’t really a good argument).

LatchBase Count

Ewald Cress looks at the Count member of the LatchBase class:

Here is the bit-level layout of Count to the level that I currently understand it. This has received some airplay by Bob Ward (thanks, Bob!), and I’ll be building on that. Count is a 64-bit integer broken into multiple bit fields; aside from more compact storage, the rationale for the bit packing is that the whole item can be subject to atomic updates without “external” locking, much as in the SOS_RWLock. Regarding the unlabelled bits, I know for a fact that bit 5 is used, but not yet sure of the semantics.

After spending several posts on the foundation structures, Ewald is moving up the layers of internals, getting closer to concepts we think about on a day-to-day basis.

Trace Flag 2389

Erin Stellato looks at using Trace Flag 2389 with the new cardinality estimator in SQL Server 2014:

To summarize, when using compatibility mode 110 or below, trace flag 2389 works like it always has.  But when using compatibility mode 120 or higher, and thus the new CE, the estimates are not the same compared to the old CE, and in this specific case, are not that different whether using the trace flag or not.

So what should you do?  Test, as always.  I haven’t found anything documented in MSDN that states that trace flag 2389 is not supported with compatibility mode 120 and higher, nor have I found anything that documents a change in behavior.  I do find it very interesting that the estimates are different (in this case much lower) with the new CE.  That could potentially be an issue, but there are multiple factors in play when it comes to estimates, and this was a very simple query (one table, one predicate).  In this case, the estimate is way off (4920 rows versus the 22,595 rows for the June 5 date).

I highly recommend reading this article.

Using The Spark-HBase Connector

Anunay Tiwari shows how to use the Spark-HBase connector in HDInsight:

The Spark-Hbase Connector provides an easy way to store and access data from HBase clusters with Spark jobs. HBase is really successful for highest level of data scale needs. Thus, existing Spark customers should definitely explore this storage option. Similarly, if the customers are already having HDinsight HBase clusters and they want to access their data by Spark jobs then there is no need to move data to any other storage medium. In both the cases, the connector will be extremely useful.

I’m not the biggest fan of HBase, but if it’s part of your environment, you should definitely look at this Spark connector.

Range And Variance

Mala Mahadevan looks at calculating range, variance, and standard deviation in R and T-SQL:

The first and most common measure of dispersion is called ‘Range‘. The range is just the difference between the maximum and minimum values in the dataset. It tells you how much gap there is between the two and therefore how wide the dataset is in terms of its values. It is however, quite misleading when you have outliers in the data. If you have one value that is very large or very small that can skew the Range and does not really mean you have values spanning the minimum to the maximum.

To lower this kind of an issue with outliers – a second variation of the range called Inter-Quartile Range, or IQR is used. The IQR is calculated by dividing the dataset into 4 equal parts after sorting the said value in ascending order. For the first and third part, the maximum values are taken and then subtracted from each other. The IQR ensures that you are looking at top and near-bottom ranges and therefore the value it gives is probably spanning the range.

Just like her previous post, this one also includes an example built for SQL Server R Services.

Storytelling With Data

Vik Paruchuri walks through exploratory data analysis using New York City schools data:

Heatmaps are good for mapping out gradients, but we’ll want something with more structure to plot out differences in SAT score across the city. School districts are a good way to visualize this information, as each district has its own administration. New York City has several dozen school districts, and each district is a small geographic area.

We can compute SAT score by school district, then plot this out on a map. In the below code, we’ll:

  • Group full by school district.

  • Compute the average of each column for each school district.

  • Convert the school_dist field to remove leading 0s, so we can match our geograpghic district data.

Also check out part 1 if you missed it.


December 2018
« Nov