KSQL: Streaming SQL For Kafka

Neha Narkhende announces KSQL:

I’m really excited to announce KSQL, a streaming SQL engine for Apache KafkaTM. KSQL lowers the entry bar to the world of stream processing, providing a simple and completely interactive SQL interface for processing data in Kafka. You no longer need to write code in a programming language such as Java or Python! KSQL is open-source (Apache 2.0 licensed), distributed, scalable, reliable, and real-time. It supports a wide range of powerful stream processing operations including aggregations, joins, windowing, sessionization, and much more.

Feasel’s Law wins again.  The syntax looks pretty similar to Spark Streaming and Stream Analytics, so if you get those, you’ll get this.

Monitoring Kafka Lag

Kevin Feasel



Bas Harenslak explains how to monitor consumer lag in Kafka:

So you’ve written e.g. a Spark ETL pipeline reading from a Kafka topic. There are several options for storing the topic offsets to keep track of which offset was last read. One of them is storing the offsets in Kafka itself, which will be stored in an internal topic __consumer_offsets. If you’re using the Kafka Consumer API (introduced in Kafka 0.9), your consumer will be managed in a consumer group, and you will be able to read the offsets with a Bash utility script supplied with the Kafka binaries.

The Prometheus mentioned in the article is an open-source monitoring solution.

Power BI Report Server August Preview

Aaron Nelson gives us a happy report:

Power BI Report Server August 2017 Preview is now available. Think of this a “v.Next” of Power BI Report Server [On-Premises], for all Data Sources.

Here’s an excerpt from the Power BI blog post from this weekend:

With this August 2017 preview, users can create Power BI reports in Power BI Desktop that connect to any data source, and publish their reports to Power BI Report Server. There’s no special configuration required to enable this functionality

Read on for more information and a link to download the latest preview.  It had me as soon as I read “all data sources.”

Basics Of Azure SQL Data Warehouse

Minette Steynberg has an article introducing Azure SQL Data Warehouse:

Azure SQL DW is best used for analytical workloads that makes use of large volumes of data and needs to consolidate disparate data into a single location.

Azure SQL DW has been specifically designed to deal with very large volumes of data. In fact, if there is too little data it may perform poorly because the data is distributed. You can imagine that if you had only 10 rows per distribution, the cost of consolidating the data will be way more than the benefit gained by distributing it.

SQL DW is a good place to consolidate disparate data, transform, shape and aggregate it, and then perform analysis on it. It is ideal for running burst workloads, such as month end financial reporting etc.

Azure SQL DW should not be used when small row by row updates are expected as in OLTP workloads. It should only be used for large scale batch operations.

Azure SQL Data Warehouse is fantastic when you’ve got a setup like above and are willing to pay a premium to make things faster.  And with appropriately distributed data, it certainly does get faster.

Sizing Memory-Optimized Workloads

Prashanth Purnananda gives us a few notes regarding memory-optimized table sizes:

Recovering database with memory-optimized tables involves hydrating the contents of checkpoint files (data/delta files) into memory and then replaying the tail of the log (see this link for more details). One of the important difference between disk based tables and memory-optimized store is frequency of checkpoints. Automatic checkpointing for in-memory tables occurs every 1.5GB of log records unlike traditional or indirect checkpoints (where checkpointing is done more often) leading to longer tail of log for in-memory tables. The 1.5 GB log flush is chosen to strike the right balance between flooding the IO subsystem with too many small inefficient IO operations and too few large IOPs. In most scenarios observed by our CSS teams, long recovery times for memory optimized databases is caused by the long tail of log which needs to be recovered for in-memory tables in the database. For these scenarios, running a manual checkpoint before a restart can reduce recovery times as manual checkpoint forces the checkpoint for memory optimized tables in addition to disk based tables.

If you’re looking at creating memory-optimized tables, these are important administrative notes.

Azure SQL Database Compatibility Level Change

Joe Sack reports that new Azure SQL Databases will have a compatibility level of 140 by default:

Once this new database compatibility default goes into effect, if you still wish to use database compatibility level 130 (or lower), please follow the instructions detailed here: View or Change the Compatibility Level of a Database.  For example, you may wish to ensure that new databases created in Azure SQL Database use the same compatibility level as other databases in Azure SQL Database to ensure consistent query optimization behavior across development, QA and production versions of your databases. We recommend that database configuration scripts explicitly designate COMPATIBILITY_LEVEL rather than rely on the defaults, in order to ensure consistent application behavior.

For new databases supporting new applications, we recommend using the latest compatibility level (140).  For pre-existing databases running at lower compatibility levels, the recommended workflow for upgrading the query processor to a higher compatibility level is detailed in the article, Change the Database Compatibility Mode and Use the Query Store.  Note that this article refers to compatibility level 130 and SQL Server, but the same methodology applies for moves to 140 for SQL Server and Azure SQL DB.

It’s good to hear, and as Joe mentions, you have the ability to move back down to 130 if you need it.

Selecting Into Tables, Sans Identity

Kevin Feasel



Kenneth Fisher shows a couple of ways to remove an identity property from a column when creating a new table:

A while back I did a post about creating an empty table using a SELECT statement. Basically doing something like this:

SELECT TOP 0 * INTO tableNameArchive FROM tableName

will create a new table with the exact same structure as the source table. It can be a really handy way to create an archive table, a temp table, etc. You don’t create any of the extra objects (indexes, triggers, constraints etc) but what you do end up with is every table property from the original table. This includes datatypes, nullability, and (as I’m sure you realized from the title) IDENTITY. Which if you are creating an archive table, a temp table, etc is probably not something you want. Fortunately, there are two easy ways to get rid of the identity.

Click through to see those two methods.

Early Thoughts On New AMD CPUs

Glenn Berry talks about the new AMD processor lines and how they might work with SQL Server:

AMD is really pushing the idea of a single-socket EPYC system as a better alternative to a two-socket Intel system for many server workloads. According to AMD, it will be much less expensive, yet will have plenty of cores, memory, and PCIe 3.0 lanes, along with no NUMA overhead. One key advantage AMD is touting is their Infinity Fabric modular interconnect technology, that works both within a single processor and between multiple processors.

For SQL Server 2016/2017 usage, you would still want the “top of the line” SKU for a given physical core count, to get the most performance for each physical core license that you buy. Unlike Intel, AMD does not increase the base clock speed in the lower core count models. These EPYC systems have a lot of PCIe 3.0 lanes and very high memory density, so they might work really well for large SQL Server DW/Reporting workloads. For OLTP workloads, the key will be how much single-threaded performance AMD is able to get from this first-generation of EPYC, and how they compare to Intel’s new Skylake-SP processors. Figure 3 shows the fastest EPYC processor at each core count, which is what you would want for SQL Server usage.

There aren’t too many hard numbers yet, but the worst case scenario is that they force Intel to improve their offerings.


August 2017
« Jul Sep »