Spark Streaming Vs Kafka Streams

Mahesh Chand Kandpal contrasts Kafka Streams with Spark Streaming:

The low latency and an easy-to-use event time support also apply to Kafka Streams. It is a rather focused library, and it’s very well-suited for certain types of tasks. That’s also why some of its design can be so optimized for how Kafka works. You don’t need to set up any kind of special Kafka Streams cluster, and there is no cluster manager. And if you need to do a simple Kafka topic-to-topic transformation, count elements by key, enrich a stream with data from another topic, or run an aggregation or only real-time processing — Kafka Streams is for you.

If event time is not relevant and latencies in the seconds range are acceptable, Spark is the first choice. It is stable and almost any type of system can be easily integrated. In addition it comes with every Hadoop distribution. Furthermore, the code used for batch applications can also be used for the streaming applications as the API is the same.

Read on for more analysis.

Combinatorics With Joins

Dmitry Zaytsev explains the math behind why query plans can be so inefficient when dealing with a large number of joins:

Let’s talk about the sequence of table joins in detail. It is very important to understand that the possible number of table joins grows exponentially, not linearly. Fox example, there are only 2 possible methods to join 2 tables, and the number can reach 12 methods for 3 tables. Different join sequences can have different query cost, and SQL Server optimizer must select the most optimal method. But when the number of tables is high, it becomes a resource-intensive task. If SQL Server begins going over all possible variants, such query may never be executed. That is why, SQL Server never does it and always looks for a quite good plan, not the best plan. SQL Server always tries to reach compromise between execution time and plan quality.

There are ways you can help the optimizer, and one of my favorite query tuning books was all about table selection.

Interesting SQL Server GitHub Repos

Denis Gobo links to five interesting GitHub repos:


dbatools is a free PowerShell module with over 200 SQL Server administration, best practice and migration commands included.

GitHub Repo is here:

Also make sure to visit their website:

Read on for the other four.

Multi-Instance Powershell Queries

Jana Sattainathan shows an easy way of performing an operation against a number of SQL Server instances:

Today, I received a really simple request. There was an UPDATE statement that had to be run on quite a few databases in multiple instances.  This was for a vendor software we use and the statement was sent to me by a Helpdesk professional who manages the product.

You need to have the SQLPS (older) or the SQLSERVER (newer) PowerShell module installed.

Click through for the code.  If you want to extend this further, I’d look at using a Central Management Server to pull the instances and maybe looking at parallel processing if you have a large number of servers.

Rotating Tiles Custom Visual

Devin Knight continues his Power BI custom visuals series:

In this module you will learn how to use the Rotating Tile Custom Visual.  The Rotating Tile gives you the ability to display multiple metrics on a single visual that rotates through each value you wish to display.  This allows you to save valuable space on your reports!

This feels like the type of thing that works on a dashboard but would get frustrating if you used it for time-sensitive data or data which required thoughtful analysis.

Reducing TempDB Usage With Memory-Optimized Objects

Mark Wilkinson shows how to replace temp tables (or table variables) with memory-optimized table variables to reduce tempdb latching:

If all worked, you should now see that we have contention on the sysschobjs table. Earlier we discussed using sp_help to get index details on system tables, if we do that now and look at index 2, we will see the lead column is nsclass which is a tinyint field. Using a tinyint as a lead column is typically a terrible idea since there is little selectivity on such a narrow field, and this is no exception.

This isn’t the only case of contention you might see with system objects related to temporary tables. We ran into a few different contention scenarios with tempdb:

  • Contention on sysschobjs again, but on index 3. This index leads with the name of the temporary table and is fairly narrow so you can fit a lot of records on a single index page. Because of this, if you are running lots of concurrent procedures that create temporary tables with the same or similar names, it creates a hot spot on a single page, leading to more contention.

  • Temporary table auto-stats. Statistics objects for all tables (including temporary tables) are stored in the sys.sysobjvalues table. If you get enough auto-stats generations on temporary tables you can see contention here.

Mark’s post reads like a book chapter and he does a great job of summing up the problem and the solution.

Database Code Analysis

William Brewer has an interesting article on performing code analysis on database objects:

In general, code analysis is not just a help to the individual developer but can be useful to the entire team. This is because it makes the state and purpose of the code more visible, so that it allows everyone who is responsible for delivery to get a better idea of progress and can alert them much earlier to potential tasks and issues further down the line. It also makes everyone more aware of whatever coding standards are agreed, and what operational, security and compliance constraints there are.

Database Code analysis is a slightly more complicated topic than static code analysis as used in Agile application development. It is more complicated because you have the extra choice of dynamic code analysis to supplement static code analysis, but also because databases have several different types of code that have different conventions and considerations. There is DML (Data Manipulation Language), DDL (Data Definition Language), DCL (Data Control Language) and TCL (Transaction Control Language).  They each require rather different analysis.

William goes on to include a set of good resources, though I think database code analysis, like database testing, is a difficult job in an under-served area.

Calculating Relative Risk In T-SQL

Mala Mahadevan explains how to calculate relative risk using T-SQL:

In this post we will explore a common statistical term – Relative Risk, otherwise called Risk Factor. Relative Risk is a term that is important to understand when you are doing comparative studies of two groups that are different in some specific way. The most common usage of this is in drug testing – with one group that has been exposed to medication and one group that has not. Or , in comparison of two different medications with two groups with each exposed to a different one.

Read on for an example of a statistical formula calculation which might actually be easier in T-SQL than R.

Random Forests In scikit-learn

Mark Needham shows how easy it is to create a random forest model in Python using scikit-learn:

As I mentioned in a blog post a couple of weeks ago, I’ve been playing around with the Kaggle House Prices competition and the most recent thing I tried was training a random forest regressor.

Unfortunately, although it gave me better results locally it got a worse score on the unseen data, which I figured meant I’d overfitted the model.

I wasn’t really sure how to work out if that theory was true or not, but by chance, I was reading Chris Albon’s blog and found a post where he explains how to inspect the importance of every feature in a random forest. Just what I needed!

There’s a nagging voice in my head saying “Principal Component Analysis” as I read this post.

Comparing Spark Streaming, Flink, And Kafka Streams

Shivangi Gupta contrasts three streaming technologies:

Flink and Spark are in-memory databases that do not persist their data to storage. They can write their data to permanent storage, but the whole point of streaming is to keep it in memory, to analyze current data. All of this lets programmers write big data programs with streaming data. They can take data in whatever format it is in, join different sets, reduce it to key-value pairs (map), and then run calculations on adjacent pairs to produce some final calculated value. They also can plug these data items into machine learning algorithms to make some projection (predictive models) or discover patterns (classification models).

Streaming has become the product-level battleground in the Hadoop ecosystem, and it’s interesting to see the different approaches that different groups have taken.


June 2017
« May