2020-02-27 – Curated SQL

Loading Data into Delta Lake

Published 2020-02-27 by Kevin Feasel

Prakash Chockalingam takes us through auto-loading Delta Lake from various sources:

Auto Loader is an optimized file source that overcomes all the above limitations and provides a seamless way for data teams to load the raw data at low cost and latency with minimal DevOps effort. You just need to provide a source directory path and start a streaming job. The new structured streaming source, called “cloudFiles”, will automatically set up file notification services that subscribe file events from the input directory and process new files as they arrive, with the option of also processing existing files in that directory.

This does look interesting.

Comments closed

How Apache Beam Runs on Top of Apache Flink

Published 2020-02-27 by Kevin Feasel

Maximilian Michels and Markos Sfikas explain why you might want to combine Apache Beam with Apache Flink:

Apache Flink and Apache Beam are open-source frameworks for parallel, distributed data processing at scale. Unlike Flink, Beam does not come with a full-blown execution engine of its own but plugs into other execution engines, such as Apache Flink, Apache Spark, or Google Cloud Dataflow. In this blog post we discuss the reasons to use Flink together with Beam for your batch and stream processing needs. We also take a closer look at how Beam works with Flink to provide an idea of the technical aspects of running Beam pipelines with Flink. We hope you find some useful information on how and why the two frameworks can be utilized in combination. For more information, you can refer to the corresponding documentation on the Beam website or contact the community through the Beam mailing list.

Read on for the full story. If you’re so inclined, you can also check out the full talk as a video.

Comments closed

A Metadata-Driven Framework for ADF Pipelines

Published 2020-02-27 by Kevin Feasel

Paul Andrew has started a series on metadata-driven Azure Data Factory pipelines:

The concept of having a processing framework to manage our Data Platform solutions isn’t a new one. However, overtime changes in the technology we use means the way we now deliver this orchestration has to change as well, especially in Azure. On that basis and using my favourite Azure orchestration service; Azure Data Factory (ADF) I’ve created an alpha metadata driven framework that could be used to execute all our platform processes. Furthermore, at various community events I’ve talked about bootstrapping solutions with Azure Data Factory so now as a technical exercise I’ve rolled my own simple processing framework. Mainly, to understand how easily we can make it with the latest cloud tools and fully exploiting just how dynamic you can get a set of generational pipelines.

This first post lays out some of the architectural decisions and prep work needed for the series.

Comments closed

When the Optimizer Can Use Batch Mode on Row Store

Published 2020-02-27 by Kevin Feasel

Erik Darling looks at some internals for us:

Things like Accelerated Database Recovery, Optimize For Sequential Key, and In-Memory Tempdb Metadata are cool, but they’re server tuning. I love’em, but they’re more helpful for tuning an entire workload than a specific query.
The thing with BMOR is that it’s not just one thing. Getting Batch Mode also allows Adaptive Joins and Memory Grant Feedback to kick in.
But they’re all separate heuristics.

Read on to see the extended events around batch mode to help you determine if it’s possible for the optimizer to use it for a given query.

Comments closed

Creating an Elastic Jobs Agent

Published 2020-02-27 by Kevin Feasel

Kate Smith continues a series on elastic jobs in Azure SQL Database:

Having laid the conceptual groundwork for Elastic Jobs in two previous postings (1, 2), I am now going to create an elastic job and associated credentials using PowerShell. For this scenario, I have one or more databases with a table ‘T’ and statistics ‘tStats’. I want to enforce an update for these statistics every day. To do this, I need to check that my stats have been updated in the past day, and if not, update them. The T-SQL to update statistics on a table “T” with stats named “tStats” is simple:

Click through for the Powershell script.

Comments closed

Understanding SOS_SCHEDULER_YIELD

Published 2020-02-27 by Kevin Feasel

David Fowler takes us through the SOS_SCHEDULER_YIELD wait type:

I decided to write this off the back of a conversation I was having the other day around the SOS_SCHEDULER_YIELD wait type.
The conversation went something along the lines of “but David, I’m seeing SOS_SCHEDULER_YIELD, we must have CPU issues”.
Yes this particular customer had been CPU bound recently but was that really their problem now, what is SOS_SCHEDULER_YIELD really mean?

It’s a good write-up of when it is and is not a problem.

Comments closed

M	T	W	T	F	S	S
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29

Day: February 27, 2020

Loading Data into Delta Lake

How Apache Beam Runs on Top of Apache Flink

A Metadata-Driven Framework for ADF Pipelines

When the Optimizer Can Use Batch Mode on Row Store

Creating an Elastic Jobs Agent

Understanding SOS_SCHEDULER_YIELD