Containerized SQL Server 2016 Express Edition

The SQL Server team announces a Docker image for SQL Server 2016 Express Edition:

SQL Server 2016 in a Windows container would be ideal when you want to:

  1. Quickly create and start a set of SQL Server instances for development or testing.

  2. Maximize density in test or production environments, especially in microservice architectures.

  3. Isolate and control applications in a multi-tenant infrastructure.

Works on Windows only.  Given that SQL Server on Linux is going to be a thing, I could see general release through Docker there as well, but we’ll see.

Rebuild Or Reorganize?

Kendra Little answers the age-old question:

With very large indexes, rebuilds take longer, generate more log, impact performance more while they’re running.

If you’re using high availability features like Availability Groups or database mirroring that stream the log between replicas, generating a lot of log data very quickly can create problems.

Your replica/ mirrors may fall behind. Depending on the latency between replicas, the size of the indexes rebuilt, and other operations in the database, they may be out of your Recovery Point Objective / Recovery Time objective for a long time.

In this situation, it’s particularly attractive to drip changes into the log more slowly. One of the ways to do this is to use REORGANIZE for those large indexes.

There’s a lot of nuance here, so give it a read (or watch the video).

Spark Overview

Jen Underwood provides an overview of the Apache Spark project:

Spark provides a comprehensive framework to manage big data processing with a variety of data set types including text and graph data. It can also handle batch pipelines and real-time streaming data. Using Spark libraries, you can create big data analytics apps in Java, Scala, Clojure, and popular R and Python languages.

Spark brings analytics pros an improved MapReduce type query capability with more performant data processing in memory or on disk. It can be used with datasets that are larger than the aggregate memory in a cluster. Spark also has savvy lazy evaluation of big data queries which helps with workflow optimization and reuse of intermediate results in memory. TheSpark API is easy to learn.

One of my taglines is, Spark is not the future of Hadoop; Spark is the present of Hadoop.  If you want to get into this space, learn how to work with Spark.

Threads Need Memory Too

Arun Sirpal notices a bug with fn_dump_dblog():

Using this command creates more threads and hidden schedulers (these will only go after a restart). Depending on what version of SQL Server you are on and what Service Pack you may or may not have this issue. It was fixed in SQL 2012 SP2 onwards. So be on the cautious side when running these sorts of commands.

Also I noticed Memory bloat for the sqlservr.exe. Nothing else was running on this server, just my fn_dump_dblog script.  Threads need memory too.

It’s good advice.  Undocumented functions are probably more likely than documented functions to contain bugs.

Association Rules

Kevin Feasel



Tomaz Kastrun discusses product variants:

To sum up, association rules is a great and powerful algorithm for finding the correlations between items and the fact that you can use this straight from SSMS, it just gives me goosebumps. Currently just the performance is a bit of a drawback. Also comparing this algorithm to Analysis services (SSAS) association rules, there are many advantages on R side, because of maneuverability and extracting the data to T-SQL, but keep in mind, SSAS is still very awesome and powerful tool for statistical analysis and data predictions.

Figuring out variations after the fact is an all-too-common task, and this is a good way of getting some ideas on how to do that.

Kafka Enrichment

I have an article on enriching data stored in a Kafka topic:

We’re going a bunch of setup work here, so let’s take it from the top.  First, I declare a consumer group, which I’m calling “Airplane Enricher.”  Kafka uses the concept of consumer groups to allow consumers to work in parallel.  Imagine that we have ten separate servers available to process messages from the Flights topic.  Each flight message is independent, so it doesn’t matter which consumer gets it.  What does matter, though, is that multiple consumers don’t get the same message, as that’s a waste of resources and could lead to duplicate data processing, which would be bad.

The way Kafka works around this is to use consumer groups:  within a consumer group, only one consumer will get a particular message.  That way, I can have my ten servers processing messages “for real” and maybe have another consumer in a different consumer group just reading through the messages getting a count of how many records are in the topic.  Once you treat topics as logs rather than queues, consumer design changes significantly.

This is a fairly lengthy read, but directly business-applicable, so I think it’s well worth it.

Handling Large Data Modifications

Kevin Feasel



Jeff Mlakar shows how to insert, update, and delete large numbers of records with T-SQL:

Using T-SQL to insert, update, or delete large amounts of data from a table will results in some unexpected difficulties if you’ve never taken it to task.

Let’s say you have a table in which you want to delete millions of records. If the goal was to remove all then we could simply use TRUNCATE. However, if we want to remove records which meet some certain criteria then executing something like this will cause more trouble that it is worth.

I do like the delete process.  The update process is going to run tableRows/batchSize full scans, so I’m not as fond of that one.  Do read the whole thing.

T-SQL Tuesday #83 Roundup

Andy Mallon handles T-SQL Tuesday duties this month:

For this month’s T-SQL Tuesday, I asked people to blog about the same old problems we’ve been dealing with for years. There were some great posts, including a batch of first-time contributors. You’ll notice some overlapping themes as you read through these responses–I think those themes represent some of the biggest, most important problems we have (like being able to restore backups). Thanks for everyone who contributed this month!

This was an interesting one with quite a few contributors.

Control Flow Package Parts

Todd McDermid explains a feature new to Integration Services 2016:

The basic idea behind package parts makes complete sense to a coder – they’re macros.  You take code you’ve used in several places, put it in a separate file that you then include and “expand” in multiple other files.
If you have multiple packages with parts of the Control Flow that are identical – setting up a database in a certain way, sending emails, calling a set of stored procedures, … – then Control Flow Package Parts can help.
The assistance isn’t just limited to the initial coding, either.  Yes – creating a new package with the “duplicate” code is much easier.  But the real gain of Control Flow Package Parts is when your “standard” code needs changes.  Instead of having to edit multiple packages to address the modifications – you only have to alter the package part.  Deploying the project(s) that depend on this part automatically incorporates those improvements.

I’d be a lot more interested in this if Biml weren’t already a better option.  Read on for Todd’s rundown.

Processing 2016 Tabular From SSIS 2014

Meagan Longoria shows how to process a Tabular Model with a compatibility level of 1200 in SQL Server Integration Services 2014:

Attempting to use the AS Processing Task results in the following error: “[Analysis Services Execute DDL Task] Error: This command cannot be executed on database ‘MySSASDB’ because it has been defined with StorageEngineUsed set to TabularMetadata. For databases in this mode, you must use Tabular APIs to administer the database”

The reason for keeping SSAS processing in an SSIS package was because it kept consistent logging throughout their data refresh process. So we set out to find another solution.

Read on for the explanation and the solution.


October 2016
« Sep Nov »