Chaining Exactly-Once Operations With Kafka

Ben Stopford shows how you can use Kafka to chain together services while maintaining exactly-once guarantees:

Any service-based architecture is itself a distributed system, a field renowned for being difficult, particularly when things go wrong. We have thought experiments like The Two Generals Problemand proofs like FLP which highlight that these systems are difficult to work with.

In practice we make compromises. We rely on timeouts. If one service calls another service and gets an error, or no response at all, it retries that call in the knowledge that it will get there in the end.

The problem is that retries can result in duplicate processing—which can cause very real problems. Taking a payment, twice, from someone’s account will lead to an incorrect balance. Adding duplicate tweets to a user’s feed will lead to a poor user experience.  The list goes on.

I just had a discussion at SQL Saturday Albany about this exact thing, and the pain of rolling your own solutions.

Spark Data Structures

Kevin Feasel



Shubham Agarwal explains the difference between three Spark data structures:

DataFrame(DF) – 

DataFrame is an abstraction which gives a schema view of data. Which means it gives us a view of data as columns with column name and types info, We can think data in data frame like a table in the database.

Like RDD, execution in Dataframe too is lazy triggered.

Read on to learn more about Resilient Distributed Datasets, DataFrames, and DataSets.

Subscription Versus Assignment In Kafka

Kevin Feasel



Paolo Patierno explains why you shouldn’t mix subscribe() and assign() in Kafka:

Another great advantage of consumers grouping is the rebalancing feature. When a consumer joins a group, if there are still enough partitions available (i.e. we haven’t reached the limit of one consumer per partition), a re-balancing starts and the partitions will be reassigned to the current consumers, plus the new one. In the same way, if a consumer leaves a group, the partitions will be reassigned to the remaining consumers.

What I have told so far it’s really true using the subscribe() method provided by the KafkaConsumerAPI. This method forces you to assign the consumer to a consumer group, setting the property, because it’s needed for re-balancing. In any case, it’s not the consumer’s choice to decide the partitions it wants to read for. In general, the first consumer joins the group doing the assignment while other consumers join the group.

Read on to learn more.

Troubleshooting MSDTC

Jeff Mlakar has a troubleshooting guide for the Distributed Transaction Coordinator:

Microsoft Distributed Transaction Coordinator (MSDTC) is a Windows service that coordinates transactions spanning multiple databases. It is a transaction manager that allows applications to include several different sources of data in one transaction. MSDTC coordinates committing the distributed transaction across all servers enlisted in the transaction.

An example here would be a process on one machine calling a stored procedure on another machine which in requires data that changes together in a transaction.

The MSDTC service is running on each of the servers to manage the successful commit (or rollback) of the transaction across all servers enlisted in the transaction.

My MSDTC experience is generally negative, that this is more trouble than it’s worth.  But sometimes you’ve got to use it, and when you do, it’s nice to have the skinny on it.

Auto-Restarting Docker Containers

Andrew Pruski helps out those of us who can’t seem to let our containers go:

One of the problems that I’ve encountered since moving my Dev/QA departments to using SQL Server within containers is that the host machine is now a single point of failure.

Now there’s a whole bunch of posts I could write about this but the one point I want to bring up now is…having to start up all the containers after patching the host.

I know, I know…technically I shouldn’t bother. After patching and bouncing the host, the containers should all be blown away and new ones deployed. But this is the real world and sometimes people want to retain the container(s) that they’re working with.

It’s pretty easy to set up a restart policy, as Andrew shows.

Important Stored Procedures For Transactional Replication

Drew Furgiuele (who always starts off with the red ring) explains three important replication stored procedures:

We all have that one scenario where replication caused us some major headaches. It’s through these headaches, though, that the search for better ways to handle problems comes up. Through my time with replication, I’ve researched and used a lot of system tables, views, and stored procedures to diagnose or fix replication problems. There’s actually a lot of things you can get into if you’re so inclined. If you put me on the spot though, and asked me “Drew, what do you think are the most helpful replication internal objects to know?” I’d tell you I have three, and they’re all stored procedures.

Here then, in no particular order, are three system stored procedures that I feel any DBA that has to deal with transactional replication should at least be aware of. They each have a unique purpose, and each can be used to quickly monitor, fix, or even destroy replication forcibly, respectively.

Read the whole thing.

The Trivial Plan Problem With Columnstore

Niko Neugebauer shows that trivial columnstore plans can lead to poor performance, but SQL Server 2017 has a fix:

Do you remember one of the major problems in SQL Server 2014 using Columnstore Indexes ? It was the lack of the support for the Batch Execution Mode with just a single core. We would get wonderful, fast execution plans with MAXDOP >= 2, which will go terribly slow if there would not be enough memory to run the query with 2 or more cores, or if the internal query cost would be below the parallel execution threshold (cost threshold for parallelism)
OR if the execution plan would be dimmed as TRIVIAL by the Query Optimiser, thus getting a single core execution and running really slow.
Once we upgraded to SQL Server 2016, the problem of inability of the single core Batch Mode execution would fade away, but still, sometimes some queries would run terribly slow for some reason …
One of the reasons behind this are the trivial execution plans, which are running Columnstore Index Scan in the Row Execution Mode – also known as a VerySlowExecutionMode for the big amounts of data.

Read on to see the change in 2017, as well as a workaround for 2016.

Time Slicing T-SQL

Kevin Feasel



Bill Fellows has some fancy window function footwork to split data into time slices:

That looks like a lot, but it really isn’t. Starting from the first inner most query, we select the top 24 rows from sys.all_objects and use the ROW_NUMBER function to generate us a monotonically increasing set of values, thus 1…24. However, since the allowable range of hours is 0 to 23, I deduct one from this value (A). I repeat this pattern to generate minutes (B) except we get the top 60. Since I want 15 second intervals, for the seconds query, I only get the top 4 values. I deduct one so we have {0,1,2,3} and then multiply by 15 to get my increments (C). If you want different time slices, that’s how I would modify this pattern.

Only works on 2012 or later, but it’s a fancy way of slicing into 15-second (or however you define it) chunks.

The Great Custom Visual Move

Ginger Grant runs into a problem with some of the custom visuals (not) in the Office Store:

This week I decided to do a demo using the Aquarium custom visual.  As readers of my blog know, I have used the custom visual before, but it has been a while and I have changed PCs since then.  No worries I can always go download the visual from the store, right? Wrong. The aquarium visual is not available on the new store. Neither is Image Viewer, if one is looking to add that into your latest Power BI report it is not available. What happened?

Read on to learn why the aquarium is missing.  RIP aquarium (hopefully only temporarily).


July 2017
« Jun Aug »