Press "Enter" to skip to content

Day: July 25, 2016

Securing Kafka Streams

Michael Noll shows security features of Kafka Streams:

First, which security features are available in Apache Kafka, and thus in Kafka Streams?  Kafka Streams supports all the client-side security features in Apache Kafka.  In this short blog post we cannot cover these client-side security features in full detail, so I recommend reading the Kafka Security chapter in the Confluent Platform documentation and our previous blog post Apache Kafka Security 101 to familiarize yourself with the security features that are currently available in Apache Kafka.

That said, let me highlight a couple of important Kafka security features that are essential for implementing robust data infrastructures, whether these are used for building horizontal services at larger companies, for multi-tenant infrastructures (e.g. microservices), or for shared platforms such as in the Internet of Things.  Later on I will then demonstrate an example application where we use some of these security features in Kafka Streams.

It’s important to secure sensitive data, even in “transient” media like Kafka (though the transience of Kafka is user-definable, so “It’ll go away soon” isn’t really a good argument).

Comments closed

LatchBase Count

Ewald Cress looks at the Count member of the LatchBase class:

Here is the bit-level layout of Count to the level that I currently understand it. This has received some airplay by Bob Ward (thanks, Bob!), and I’ll be building on that. Count is a 64-bit integer broken into multiple bit fields; aside from more compact storage, the rationale for the bit packing is that the whole item can be subject to atomic updates without “external” locking, much as in the SOS_RWLock. Regarding the unlabelled bits, I know for a fact that bit 5 is used, but not yet sure of the semantics.

After spending several posts on the foundation structures, Ewald is moving up the layers of internals, getting closer to concepts we think about on a day-to-day basis.

Comments closed

Trace Flag 2389

Erin Stellato looks at using Trace Flag 2389 with the new cardinality estimator in SQL Server 2014:

To summarize, when using compatibility mode 110 or below, trace flag 2389 works like it always has.  But when using compatibility mode 120 or higher, and thus the new CE, the estimates are not the same compared to the old CE, and in this specific case, are not that different whether using the trace flag or not.

So what should you do?  Test, as always.  I haven’t found anything documented in MSDN that states that trace flag 2389 is not supported with compatibility mode 120 and higher, nor have I found anything that documents a change in behavior.  I do find it very interesting that the estimates are different (in this case much lower) with the new CE.  That could potentially be an issue, but there are multiple factors in play when it comes to estimates, and this was a very simple query (one table, one predicate).  In this case, the estimate is way off (4920 rows versus the 22,595 rows for the June 5 date).

I highly recommend reading this article.

Comments closed

Using The Spark-HBase Connector

Anunay Tiwari shows how to use the Spark-HBase connector in HDInsight:

The Spark-Hbase Connector provides an easy way to store and access data from HBase clusters with Spark jobs. HBase is really successful for highest level of data scale needs. Thus, existing Spark customers should definitely explore this storage option. Similarly, if the customers are already having HDinsight HBase clusters and they want to access their data by Spark jobs then there is no need to move data to any other storage medium. In both the cases, the connector will be extremely useful.

I’m not the biggest fan of HBase, but if it’s part of your environment, you should definitely look at this Spark connector.

Comments closed

Range And Variance

Mala Mahadevan looks at calculating range, variance, and standard deviation in R and T-SQL:

The first and most common measure of dispersion is called ‘Range‘. The range is just the difference between the maximum and minimum values in the dataset. It tells you how much gap there is between the two and therefore how wide the dataset is in terms of its values. It is however, quite misleading when you have outliers in the data. If you have one value that is very large or very small that can skew the Range and does not really mean you have values spanning the minimum to the maximum.

To lower this kind of an issue with outliers – a second variation of the range called Inter-Quartile Range, or IQR is used. The IQR is calculated by dividing the dataset into 4 equal parts after sorting the said value in ascending order. For the first and third part, the maximum values are taken and then subtracted from each other. The IQR ensures that you are looking at top and near-bottom ranges and therefore the value it gives is probably spanning the range.

Just like her previous post, this one also includes an example built for SQL Server R Services.

Comments closed

Storytelling With Data

Vik Paruchuri walks through exploratory data analysis using New York City schools data:

Heatmaps are good for mapping out gradients, but we’ll want something with more structure to plot out differences in SAT score across the city. School districts are a good way to visualize this information, as each district has its own administration. New York City has several dozen school districts, and each district is a small geographic area.

We can compute SAT score by school district, then plot this out on a map. In the below code, we’ll:

  • Group full by school district.

  • Compute the average of each column for each school district.

  • Convert the school_dist field to remove leading 0s, so we can match our geograpghic district data.

Also check out part 1 if you missed it.

Comments closed

Suppress Assembly Output

Richie Lee shows how to hide assembly load information in a Powershell script:

Recently I’ve been wondering how I can suppress the output in my PowerShell scripts when loading assemblies into them. I used to find them useful; but now I find them annoying and they are no substitute for error handling ( I used to find them handy as a way of telling me that the script had got this far in the script).

There is more than one way to suppress output to the console, but for assembly loading, I prefer to use [void] because it looks neater than the alternatives:

I’d never used this technique; I always piped to $null.

Comments closed

Using Lookup Components

Thomas LeBlanc explains how to use Lookup components as part of warehouse loading:

The “Fail component” option will stop processing of the import if no match is found in the lookup table. This is not a good option for loading data into a fact table. Most systems would want the import to continue for the customer surrogate keys that are found.

  1. Ignore Failure – Null will replace lookup values selected. Those rows with no match are streamed to the normal flow in package with the Null value in match columns selected.

  2. Redirect rows to error output – red line output will show a failure but can pipe the data to any component. Those rows with no match are not streamed to the normal flow in package.

  3. Fail component – the package stops with a failure if no match exists, if all match there is no failure

  4. Redirect rows to no match output – output can be piped to another component and processing continues. Those rows with no match are not streamed to the normal flow in package.

Lookup components are one of the more powerful components in a data flow task.  Read the whole thing.

Comments closed

Restoring With Standby

Kenneth Fisher describes the WITH STANDBY option for database restorations:

It then leaves the database in a read-only state. So now you can restore the database back to a point in time (even mid log backup). Check the data. Restore a few minutes into the future. Check the data again. Over and over again until you are where you need to be. It’s still going to be tedious but better than doing the full restore over again each time you need to check, right?

On top of that we can use the same idea and combine it with log shipping. Now you not only have a spare in case of a DR situation but that spare can be read only most of the time (except when actually restoring). You can use it to run reports, ad-hoc queries, etc. (Don’t use it for CHECKDBs.)

Those are a few good uses of the WITH STANDBY option.

Comments closed

CISL 1.3.0

Niko Neugebauer has released version 1.3.0 of his Columstore script library:

I am extremely proud to share with everyone the news that the long-awaited and quite overdue release of the CISL – Columnstore Indexes Scripts Library is finally public – the 1.3.0 version. The most important part of this release is the support of the SQL Server 2016 & Azure SQLDatabase – all 3 scenarios (Nonclustered Columnstore, Disk-based Clustered Columnstore & the Memory-Optimised Clustered Columnstore Indexes) are included.
You will be able to explore all the important new architecture objects, such as Deleted Buffer & Deleted Table, plus the scripts for every version supports the new output results, even though there were no In-Memory tables in SQL Server 2012 for example.

If you use columnstore indexes, check this out.

Comments closed