Tuning Kafka And Spark Data Pipelines

Larry Murdock explains the tuning options available to Kafka and Spark Streams:

Kafka is not the Ferrari of messaging middleware, rather it is the salt flats rocket car. It is fast, but don’t expect to find an AUX jack for your iPhone. Everything is stripped down for speed.

Compared to other messaging middleware, the core is simpler and handles fewer features. It is a transaction log and its job is to take the message you sent asynchronously and write it to disk as soon as possible, returning an acknowledgement once it is committed via an optional callback. You can force a degree of synchronicity by chaining a get to the send call, but that is kind of cheating Kafka’s intention. It does not send it on to a receiver. It only does pub-sub. It does not handle back pressure for you.

I like this as a high-level overview of the different options available.  Definitely gets a More Research Is Required tag, but this post helps you figure out where to go next.

Concurrency In Scala

Matthew Rathbone shows different concurrency options available in Scala:

Scala is a functional programming language that aims to avoid side effects by encouraging you to use immutable variables (called ‘values’), and data structures.

So by default in Scala when you build a list, array, string, or other object, that object is immutable and cannot be changed or updated.

This might seem unrelated, but think about a thread which has been given a list of strings to process, perhaps each string is a website that needs crawling.

In the Java model, this list might be updated by other threads at the same time (adding / removing websites), so you need to make sure you either have a thread-safe list, or you safeguard access to it with the protected keyword or a Mutex.

By default in Scala this list is immutable, so you can be sure that the list cannot be modified by other threads, because it cannot be modified at all.

While this does force you to program in different ways to work around the immutability, it does have the tremendous effect of simplifying thread-safety concerns. The value of this cannot be understated, it’s a huge burden to worry about thread safety all the time, but in Scala much of that burden goes away.

Read the whole thing if you’re looking at writing Spark applications in Scala.  If you’re thinking about functional programming in .NET languages, F# is  there for you.

Linear Support Vector Machines

Ananda Das explains how linear Support Vector Machines work in classifying spam messages:

Linear SVM assumes that the two classes are linearly separable that is a hyper-plane can separate out the two classes and the data points from the two classes do not get mixed up. Of course this is not an ideal assumption and how we will discuss it later how linear SVM works out the case of non-linear separability. But for a reader with some experience here I pose a question which is like this Linear SVM creates a discriminant function but so does LDA. Yet, both are different classifiers. Why ? (Hint: LDA is based on Bayes Theorem while Linear SVM is based on the concept of margin. In case of LDA, one has to make an assumption on the distribution of the data per class. For a newbie, please ignore the question. We will discuss this point in details in some other post.)

This is a pretty math-heavy post, so get your coffee first. h/t R-Bloggers.

SQL Client Aliases

Andrew Pruski explains how to use a lesser-known feature in SQL Server, client aliases:

One of the problems that we ran into when moving to using containers was how to get the applications to connect. Let me explain the situation.

The applications in our production environment use DNS CNAME aliases that reference the production SQL instance’s IP address. In our old QA environment, the applications and SQL instance lived on the same virtual server so the DNS aliases were overwritten by host file entries that would point to

This caused us a problem when moving to containers as the containers were on a separate server listening on a custom tcp port. Port numbers cannot be specified in DNS aliases or host file entries and we couldn’t update the application string (one of the pre-requisites of the project) so we were pretty stuck until we realised that we could use SQL client aliases.

This is definitely a place that you’d want to document changes thoroughly, as my experience is that relatively few DBAs would even think of looking there.

Database File Sizes In Powershell

Rob Sewell has a nice post on checking database file sizes using dbatools in Powershell:

As always, PowerShell uses the permissions of the account running the sessions to connect to the SQL Server unless you provide a separate credential for SQL Authentication. If you need to connect with a different windows account you will need to hold Shift down and right click on the PowerShell icon and click run as a different user.

Lets get the information for a single database. The command has dynamic parameters which populate the database names to save you time and keystrokes

It’s a great post, save for the donut chart…  Anyhow, this is recommended reading.

Azure SQL Database Premium RS

Kevin Feasel



Arun Sirpal describes a new pricing tier for Azure SQL Database:

What Microsoft classifies as IO intensive I am not so sure, personally I have not seen any sort of IOPS figure(s) for what we could expect from each service tier, it’s not like I can just run DiskSpeed and find out. Maybe the underlying storage for Premium RS databases is more geared to work with complex analytical queries, unfortunately I do not have the funds in my Azure account to start playing around with tests for Premium vs. Premium RS (I would love to).

Also and just as important, Premium RS databases run with fewer redundant copies than Premium or Standard databases, so if you get a service failure you may need to recover your database from a backup with up to a 5-minute lag. If you can tolerate 5 minute data loss and you are happy with a reduced number of redundant copies of your database then this is a serious option for you because the price is very different.

It’s a lot less expensive (just under 1/3 the cost of Premium in Arun’s example), so it could be worth checking out.

Splitting A Small Database

Brent Ozar explains why he recommended a client break out a small database:

Listen, I can explain. Really.

We had a client with a 5GB database, and they wanted it to be highly available. The data powered their web site, and that site needed to be up and running in short order even if they lost the server – or an entire data center – or a region of servers.

The first challenge: they didn’t want to pay a lot for this muffler database. They didn’t have a full time DBA, and they only had licensing for a small SQL Server Standard Edition.

Read on for the full explanation.  Given the constraints and expectations, it makes sense, and this is a good example of figuring out how expected future growth can change the bottom line for a DBA.

JSON Dates In SQL Server

Bert Wagner explains how to handle JSON datetime strings in SQL Server:

In SQL Server, datetime2’s format is defined as follows:

YYYY-MM-DD hh:mm:ss[.fractional seconds]

JSON date time strings are defined like:


Honestly, they look pretty similar. However, there are few key differences:

  • JSON separates the date and time portion of the string with the letter T

  • The Z is optional and indicates that the datetime is in UTC (if the Z is left off, JavaScript defaults to UTC). You can also specify a different timezone by replacing the Z with a + or  along with HH:mm (ie. -05:00 for Eastern Standard Time)

  • The precision of SQL’s datetime2 goes out to 7 decimal places, in JSON and JavaScript it only goes out to 3 places, so truncation may occur.

Read on for a few scripts handling datetime conversions between these types.

Changing SQL On Linux Port

Kevin Feasel



Slava Murygin shows how to change the port on which a SQL on Linux instance listens, but notes that it introduces an issue:

All these problems are expected. Prior experience shows that changing SQL Server port makes reading Error Log file impossible.

Besides of inability to read error log all other functions work fine.

I’m thinking this bug will get fixed pretty soon.  Read the whole thing.


March 2017
« Feb Apr »