Using Kafka To Go From Batch To Stream

Stephane Maarek has started a series on transforming a batch process into a streaming process using Apache Kafka.  Part one introduces the topic and two of the four microservices:

Before jumping straight in, it’s very important to map out the current process and see how we can improve each component. Below are my personal assumptions:

  • When a user writes a review, it gets POSTed to a Web Service (REST Endpoint), which will store that review into some kind of database table.

  • Every 24 hours, a batch job (could be Spark) would take all the new reviews and apply a spam filter to filter fraudulent reviews from legitimate ones.

  • New valid reviews are published to another database table (which contains all the historic valid reviews).

  • Another batch job or a SQL query computes new stats for courses. Stats include all-time average rating, all-time count of reviews, 90 days average rating, and 90 days count of reviews.

  • The website displays these metrics through a REST API when the user navigates a website.

Part two finishes up the story:

In the previous section, we learned about the early concepts of Kafka Streams, to take a stream and split it in two based on a spam evaluation function. Now, we need to perform some stateful computations such as aggregations, windowing in order to compute statistics on our stream of reviews.

Thankfully we can use some pre-defined operators in the High-Level DSL that will transform a KStream into a KTable. A KTable is basically a table that gets new events every time a new element arrives in the upstream KStream. The KTable then has some level of logic to update itself. Any KTable updates can then be forwarded downstream. For a quick overview of KStream and KTable, I recommend the quickstart on the Kafka website.

This is a nice introduction to Kafka Streams using a realistic example.

Building Observability Tools At Scale

Kevin Lew and Sangeeta Narayanan give us some lessons learned from building logging and monitoring solutions at Netflix:

We started our tooling efforts with providing visibility into device and server logs, so that our users can go to one tool instead of having to use separate data-specific tools or logging into servers. Providing visibility into logs is valuable because log messages include important contextual information, especially when errors occur.

However, at some point in our business growth, storing device and server logs didn’t scale because the increasing volume of log data caused our storage cost to balloon and query times to increase. Besides reducing our storage retention time period, we addressed scalability by implementing a real-time stream processing platform called Mantis. Instead of saving all logs to persistent storage, Mantis enables our users to stream logs into memory, and keep only those logs that match SQL-like query criteria. Users also have the choice to transform and save matching logs to persistent storage. A query that retrieves a sample of playback start events for the Apple iPad is shown in the following screenshot:

It’s an interesting post.

JSON Output And SSIS

Stacia Varga works around an oddity in the way SSIS reads JSON outputs:

What happened? The T-SQL produces the correct results in SQL Server Management Studio (SSMS). However, in SSIS, the same T-SQL statement in an OLE DB Source in a Data Flow Task produces two rows of data which adds a line feed into the flat file and renders the JSON unusable.

The problem is visible even before sending output to the flat file.

Click the link to see how Stacia solves this problem.

Permissions Error Executing R Scripts

Niels Berglund walks through a permissions error on a new installation of SQL Server 2017 CU 7 with Machine Learning Services:

Cool, all is “A-OK”! A couple of days go by, and I see that there is a Cumulative Update (CU) for SQL Server 2017 – CU7. I install it and does not think much about it. I mean: “what can go wrong, how hard can it be?”. A couple of days later and I am busy writing the follow-up post to sp_execute_external_script and SQL Compute Context – I when I try to execute sp_execute_external_script, and it falls over!

Niels has a couple false starts that he walks us through, but then lands on a solid answer.

Trickle Insertion With Clustered Columnstore Indexes

Sunil Agarwal provides a pattern for trickle loading clustered columnstore indexes:

A traditional scenario of loading data into CCI is a nightly load from one or more data files containing millions of rows. Recommended technique is to load the data with batchsize >= 102400 as explained https://blogs.msdn.microsoft.com/sqlserverstorageengine/2014/07/27/clustered-column-store-index-bulk-loading-the-data/. However, we are seeing many scenarios where data source is parallel data stream (i.e. trickle insert) to be loaded to CCI for analytics, a typical IOT scenario. CCI allows concurrent data streams into the same delta rowgroup. However, you will see higher page latch contention as you increase the concurrency. You may wonder why this is so? Well, each delta RG is organized internally as a clustered btree index and the dataload follows the pattern of monotonically increasing clustered index key causing latch contention on the last page.

Check out Sunil’s post.  I also have an older post covering my experiences with CCI trickle loads and three ETL patterns which can work.

DAX: Caching And IF Statement Optimization

Marco Russo shows us a way of improving performance on conditional statements:

Consider the following measure.

1
2
3
4
5
Margin :=
IF (
    [Sales Amount] > 0 && [Total Cost] > 0,
    [Sales Amount] - [Total Cost]
)

The basic idea is that the difference between Sales Amount and Total Cost should be evaluated only whether both measures are greater than zero. In such a condition, the DAX engine produces a query plan that evaluates each measure twice. This is visible in the storage engine requests generated for the following query.

Read on to see how Marco avoids this performance issue.

Analytics Platform System V7 Released

Microsoft has released a new version of their Analytics Platform System:

Microsoft is pleased to announce that the Analytics Platform System (APS) appliance update 7 (AU7) is now generally available. APS is Microsoft’s scale-out Massively Parallel Processing (MPP) system based on SQL Server for data warehouse specific workloads on-premises.

Customers will get significantly improved query performance and enhanced security features with this release. APS AU7 builds on appliance update 6 (APS 2016) release as a foundation. Upgrading to APS appliance update 6 is a prerequisite to upgrade to appliance update 7.

This is useful for the six customers which can afford the licensing for APS.

Getting Wait Info From Extended Events

Grant Fritchey shows how to get wait information for particular sessions from the system_health extended event:

On your servers, any of them that are SQL Server 2008 or newer, right now, unless you’ve performed actions to prevent this, you’re running the Extended Events system_health session. It’s just happening, currently, on all your servers. Nothing you need to do about it at all. I’ll be a lot of you never even knew it was there.

If you follow the link you can see all the various types of information being gathered by the Extended Event system_health session. I won’t detail all of it here. Let me just provide a little context around how the session works. First and foremost, similar to the error log, this session consists of four files, each 5mb in size, rolling over as they get filled. For systems with a very high degree of activity, that means the information here may only be hours old. However, for most of us, this provides days, if not weeks worth of information about the behavior of your system.

The system_health extended event misses a lot of stuff, but it’s quite useful when you don’t have a purpose-built monitoring solution in place.

Categories

June 2018
MTWTFSS
« May  
 123
45678910
11121314151617
18192021222324
252627282930