Machine Learning and Delta Lake

Brenner Heintz and Denny Lee walk us through solving data engineering problems with Delta Lake:

As a result, companies tend to have a lot of raw, unstructured data that they’ve collected from various sources sitting stagnant in data lakes. Without a way to reliably combine historical data with real-time streaming data, and add structure to the data so that it can be fed into machine learning models, these data lakes can quickly become convoluted, unorganized messes that have given rise to the term “data swamps.”

Before a single data point has been transformed or analyzed, data engineers have already run into their first dilemma: how to bring together processing of historical (“batch”) data, and real-time streaming data. Traditionally, one might use a lambda architecture to bridge this gap, but that presents problems of its own stemming from lambda’s complexity, as well as its tendency to cause data loss or corruption.

Read the whole thing.

Cloudera Stream Processing

Dinesh Chandrasekhar announces the new iteration of Cloudera’s streaming data processor:

Cloudera Stream Processing (CSP) is a product within the Cloudera DataFlow platform that packs Kafka along with some key streaming components that empower enterprises to handle some of the most complex and sophisticated streaming use cases. CSP provides advanced messaging, real-time processing and analytics on real-time streaming data using Apache Kafka. CSP also supports key management and monitoring capabilities powered by Cloudera Streams Management (CSM).

Sounds like they’re taking the Kafka route over Spark Streaming, Flink, Airflow, etc.

Reading SQL Server Error Logs

Thomas Rushton has a script for us:

Why Script This? What’s Wrong With SSMS’s GUI?
Well, although SSMS does allow you to look at the error logs, it’s not very helpful for filtering – you can only filter for items that match, rather than exclude items. There are a few other filters as well – I guess the whole thing is just a wrapper around xp_readerrorlog below…

But Thomas has a better way for us.

Rounding to Intervals in Various Languages

Dave Mason doesn’t like rounding to intervals very much in T-SQL:

If I had to choose any of these options for production, I’d probably go with Query 1, just because I think it would make the most sense to any other developer that might encounter it (including future me, who probably wouldn’t remember writing the code). But I’m not really thrilled with any of the options. The one thing T-SQL has going for it though, is the relative ease for truncating the TIME off of a DATETIME by casting to DATE and back to DATETIME. I couldn’t find anything like this in DAX.

Dave also shows how to do this in DAX and Powershell.

Another alternative that Dave doesn’t mention is to invert the problem: if you have a fixed set of intervals you care about (e.g., 15-minute, 30-minute, hour, 4-hour, etc.), you can create a time table. This is like a date table but contains times of the day where you’ve precalculated the intervals. Then you join to the time table and have your results right there. If you do go this route, I’d try to keep the grain of the time table as shallow as possible, maybe using DATETIME2(0) instead of DATETIME2(7).

The Folly of Aggregation, Wait Stats Edition

Erik Darling explains why simply looking at wait stats isn’t enough:

I’m going to be honest with you, dear readers. I’ve fallen out of like with wait stats.

Unless there’s a massive resource or blocking problem, they’re a bit bland.

Tuning a single query, I don’t look at them at all.

Sure, they can be a good detail when you’re painting an overall picture of a server, but they’re misleading a lot.

Erik makes a sound point. It’s usually a sound point when it reminds me of the Hayek quotation, “Mr. Keynes’s aggregates conceal the most fundamental mechanisms of change.”

Power Query and the Benefits of Immutability

Chris Webb explains why immutable expressions can be faster to run multiple times than mutable processes:

Instead of taking the value #”Sorted Rows”[Column2]{0} and storing it in the variable Column2 then adding Column2 four times, I’m  adding the expression #”Sorted Rows”[Column2]{0} together four times. The query returns the same number as the previous query. However this query takes 20 seconds to run! Why?

Read on for the explanation.

Optimize For Sequential Key

Pam Lahoud explains the context behind a new option you can add to indexes in SQL Server 2019 CTP 3.1 and later:

With last page insert contention, as the number of insert threads increases, the queue for the page latch increases which in turn increases latency. Throughput will also decrease, but if something slows down one of the threads that is holding the latch, this can trigger a convoy and throughput suddenly falls off a cliff. This typically happens when a page fills up and a new page must be added to the index (also known as a page split). The insert that triggers the new page will naturally have to hold the latch for longer than normal while the new page operation completes. This causes the queue to build up behind the latch. Adding a new page also requires an exclusive latch on the parent page, which can cause latch requests to queue at that level as well. At this point, throughput falls off a cliff.

OPTIMIZE_FOR_SEQUENTIAL_KEY aims to do two things – control the rate at which new threads are allowed to request the latch, and favor threads that are likely to keep the throughput high. These techniques will not prevent the contention or reduce latency, but they will help keep throughput consistent as concurrency increases.

Read on to learn more about the process and by implication some places where this won’t work well at all for you.

Merge Replication on Linux

Jignesh Raiyani shows how you can set up merge replication on SQL Server on Linux:

In this article, we will discuss to deploy SQL Server Merge Replication on Linux environment. Many SQL Server features are not available on Linux by Microsoft but the Replication feature exists for Linux Environment Edition. Before starting anything, let’s address this question – what is Merge Replication?

Merge Replication is a data synchronization process with one database (Publisher) to other databases (Subscriber) and vice versa. The data synchronization audit will be controlled by the distributor (Distribution database). The distributor database will manage the data synchronization between Publisher and Subscribers.

Merge replication is also an unending nightmare of pain, but you do what you gotta do.

Categories

August 2019
MTWTFSS
« Jul Sep »
 1234
567891011
12131415161718
19202122232425
262728293031