Author: Kevin Feasel

Sort-Based Blocking Shuffle in Flink

Published 2021-11-05 by Kevin Feasel

Yingjie Cao and Daisy Tsang have a multi-part series on sort-based blocking shuffles in Apache Flink. Part 1 acts as an overview:

The hash-based blocking shuffle has been supported in Flink for a long time. However, compared to the sort-based approach, it can have several weaknesses:
1. Stability: For batch jobs with high parallelism (tens of thousands of subtasks), the hash-based approach opens many files concurrently while writing or reading data, which can give high pressure to the file system (i.e. maintenance of too many file metas, exhaustion of inodes or file descriptors). We have encountered many stability issues when running large-scale batch jobs via the hash-based blocking shuffle.
2. Performance: For large-scale batch jobs, the hash-based approach can produce too many small files: for each data shuffle (or connection), the number of output files is (producer parallelism) * (consumer parallelism) and the average size of each file is (shuffle data size) / (number of files). The random IO caused by writing/reading these fragmented files can influence the shuffle performance a lot, especially on spinning disks. See the benchmark results section for more information.
By introducing the sort-based blocking shuffle implementation, fewer data files will be created and opened, and more sequential reads are done. As a result, better stability and performance can be achieved.

Part 2 provides some design considerations:

As discussed above, the hash-based blocking shuffle would produce too many small files for large-scale batch jobs. Producing fewer files can help to improve both stability and performance. The sort-merge approach has been widely adopted to solve this problem. By first writing to the in-memory buffer and then sorting and spilling the data into a file after the in-memory buffer is full, the number of output files can be reduced, which becomes (total data size) / (in-memory buffer size). Then by merging the produced files together, the number of files can be further reduced and larger data blocks can provide better sequential reads.
Flink’s sort-based blocking shuffle adopts a similar logic. A core difference is that data spilling will always append data to the same file so only one file will be spilled for each output, which means fewer files are produced.

Check it out for a behind-the-scenes look at

Comments closed

Azure Synapse Data Explorer Pools

Published 2021-11-05 by Kevin Feasel

Manoj Raheja tries announces another pool type:

At Ignite, we announced the public preview of Azure Synapse data explorer that makes it possible to query huge amounts of structured, semi-structured, and free-text telemetry and time-series data. The following are some of the key capabilities that make this possible:
– Powerful distributed query engine that indexes all data including free text and semi-structured data. The data is automatically compressed, indexed, auto-optimized, and cached on local SSDs and persisted on storage. Compute and storage are decoupled that gives you full elasticity to auto scale in/out without a downtime.
– Intuitive Kusto Query Language (KQL) that is highly optimized for exploring raw telemetry and time series data using Synapse data explore’s best-in-class text indexing for efficient free-text search, regex, and parsing on traces\text data.
– Comprehensive JSON parsing capabilities for querying semi-structured data including arrays and nested structure.
– Native, advanced time series support for creation, manipulation, and analysis of multiple time series with in-engine Python and R execution support for model scoring.

Click through for a demonstration, showing that this is for more than just logs.

Comments closed

Restoring a Database from Backup in Docker

Published 2021-11-05 by Kevin Feasel

Chad Baldwin has a container in search of a database:

Yesterday, I was watching a Pluralsight course which provided a database .bak file to follow along with the examples. I generally like to use Docker when working with SQL Server locally…but as a somewhat novice user, I have found it to be a bit of a pain if you need to deal with restoring or attaching a database.
When I run into these scenarios, I usually spin up an AWS EC2 instance, install SQL server, and work with it that way. There’s probably a simpler way to do it using RDS or Azure, but I’m not familiar with those just yet. The other option is if I have a Linux machine at hand, I will use that with Docker and mapped volumes work great.
I do happen to have a Linux machine ready to use…but I was determined to figure out how to get this working on Windows.

Bonus points for using RESTORE DATABASE syntax. Every SQL Server user should know how to back up and restore a database using only T-SQL. That’s a skill which will definitely pay dividends.

Comments closed

Testing Azure SQL MI Premium

Published 2021-11-05 by Kevin Feasel

Joe Obbish reaches for the top shelf:

At Microsoft Ignite 2021, public preview for new “premium-series” hardware was announced for Azure SQL Managed Instances. There’s even a black friday sort of sale during this month where you can do testing on premium-series VMs without paying for the compute costs. As someone without free cloud bucks: sign me up!
I did some basic query benchmarking to get an idea of the performance difference between the new premium VMs and the standard gen 5 VMs. The test VMs aren’t identical in specs: the standard-series has 4 vCore with 20.4 GB of memory and the premium-series has 8 vCore with 56 GB of memory. I will attempt to call out any situations where that spec difference had a measurable impact.

Read on for Joe’s findings.

Comments closed

Combining Multiple Header Rows with Power Query

Published 2021-11-05 by Kevin Feasel

Soheil Bakhshi has (sometimes) three header rows for the price of one:

Easy!
Hmm… Not really. Especially when we have multiple other Excel files; some have two, some have three rows of comments, and some have even more. So not only is the task time-consuming, but it also is pretty boring and prone to human errors. What if there is a custom function that I can invoke to get the job done?

Click through for one method, followed by the proper method.

Comments closed

The Cause of Resource Semaphore Waits

Published 2021-11-05 by Kevin Feasel

Chad Callihan has an analogy for us:

Have you ever spent Black Friday shopping, filled up your car, and then ran out of space at the end for a big purchase? Your vehicle is already full but that oversized exercise equipment is too big of a deal to pass up! You’re going to have to wait until you can unload at home first before there’s room for that new clothes rack…um…I mean exercise equipment.
That’s kind of the same idea as RESOURCE_SEMAPHORE waits. SQL Server has a large query (that exercise equipment) but not enough memory to execute (purchase).

Read on for Chad’s explanation of how you can deal with RESOURCE_SEMAPHORE waits, but as someone who drives a Miata, of course I have stories. My wife and I went to the mall one day, a few years back, and we bought a space heater for the room above our garage. Well, that thing was just a little too big for the trunk and so, on a 40-something degree day (approximately -15,000 in Celsius, I’m pretty sure), here we are driving down the highway in a Miata with the top down and my wife’s arms wrapped around this box (and mind you, the box was bigger around than she is, so those arms don’t quite get all the way around that box) sitting in the passenger’s seat. Good times.

Now, in fairness to that car, you can fit a lot of stuff in a Miata trunk. Another time, we’d bought a large area rug for our dining room, as well as a lengthy floor runner. When we made the purchase, I thought it was going to be delivered, but nope. I did learn that day, however, just how compressible a rug is, as they proceeded to bind that thing so together that it fit comfortably into the trunk, where by “comfortably” I mean “barely but I’m trying to sell up how large this trunk is.” As for the floor runner, we had to cram it into the space between the headrests on our seats and the raised top, where it just barely fit. Sure, I couldn’t see behind me, but winners never look back.

Comments closed

Troubleshooting with sp_HumanEvents

Published 2021-11-04 by Kevin Feasel

Erik Darling shows off sp_HumanEvents:

With shorter procs you can probably just collect actual execution plans and slam F5 like a tall glass of gin at 6am.
But you don’t wanna do that with the larger procs, for a few practical reasons:
– Lots of little queries run quickly, and we don’t care about those
– Navigating through lots of plans in SSMS is tedious
– There’s no differentiation when other procedures, etc. are invoked
– You introduce a lot of overhead retrieving and rendering all those plans
– The full query text might not be captured, which is a limitation in many places
Let’s save the day with sp_HumanEvents, my stored procedure that makes using Extended Events really easy.

Read on to see how this all works.

Comments closed

Wildcards and Data Type Precedence

Published 2021-11-04 by Kevin Feasel

Daniel Hutmacher has fun with implicit conversion:

Implicit conversions in SQL Server follow a specific, predictable order, called data type precedence. This means that if you compare or add/concatenate two values, a and b, with different data types, you can predict which one will be implicitly converted to the data type of the other one in order to be able to complete the operation.
I stumbled on an interesting exception to this rule the other day.

Click through for an example of date searching with LIKE.

Comments closed

Calculating the Adaptive Join Threshold

Published 2021-11-04 by Kevin Feasel

Paul White breaks out the math books:

One thing I want you to bear in mind throughout this piece is an adaptive join always starts executing as a batch mode hash join. This is true even if the execution plan indicates the adaptive join expects to run as a row mode apply.
Like any hash join, an adaptive join reads all rows available on its build input and copies the required data into a hash table. The batch mode flavour of hash join stores these rows in an optimized format, and partitions them using one or more hash functions. Once the build input has been consumed, the hash table is fully populated and partitioned, ready for the hash join to start checking probe-side rows for matches.
This is the point where an adaptive join makes the decision to proceed with the batch mode hash join or to transition to a row mode apply. If the number of rows in the hash table is less than the threshold value, the join switches to an apply; otherwise, the join continues as a hash join by starting to read rows from the probe input.

Read the whole thing and learn more about the cutoffs for adaptive joins.

Comments closed

Ignite Announcements

Published 2021-11-04 by Kevin Feasel

James Serra has a round-up of Ignite 2022 announcements:

Azure Managed Instance for Apache Cassandra: Is now GA. Cassandra is an open source, column family store NoSQL database. The Azure Cassandra service includes an automatic synchronization feature that can sync data between with customers’ own Cassandra instances, on-premises and elsewhere. More info

Wolfgang Strasser has some thoughts as well on what Ignite has shown us so far:

As you might have noticed, Azure Purview is one of my newest friends in Azure Data town. During Ignite, the support for Amazon RDS (Relational Database Service), the Data Lake Data Asset Access Governance, and Microsoft Defender for Cloud Integration with Azure Purview was announced.
What I really look forward to test is the Data Asset Access Governance for Data Lake storages. Imagine a world that allows you to define permission on a central place and those permissions are brought to a storage account / system of your choice..

Read both of them for two different perspectives.

Comments closed

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30