2021-01-28 – Curated SQL

Joins in Synapse Analytics Spark

Published 2021-01-28 by Kevin Feasel

This is a bit of a longer one, a look at how to do all the different joins and the exciting thing for MSSQL developers is that we get a couple of extra joins (semi and anti semi oooooooh).

Click through for lots of examples.

Comments closed

Rule 42 Software

Published 2021-01-28 by Kevin Feasel

John Mount describes a software development anti-pattern:

As software changes, it often accretes feature and drifts away from its design, if it even started with one, and many defaults and settings become undesirable. New users are blamed for not moving parameter settings away from the defaults to the “obvious” acceptable values.

Click through for the origin of the name and more info on how to avoid it.

Comments closed

Parquet 1.x Footer Content

Published 2021-01-28 by Kevin Feasel

Dmitry Tolpeko shows us what the footer of a Parquet 1.x file looks like:

Every Parquet file has the footer that contains metadata information: schema, row groups and column statistics. The footer is located at the end of the file.
A parquet file content starts and ends with 4-byte PAR1 “magic” string. Right before the ending PAR1 there is 4-byte footer length size (little-endian encoding):

Click through for more details, as well as one downside to Parquet 1.x.

Comments closed

Automating Data Comparison using Biml

Published 2021-01-28 by Kevin Feasel

Ben Weissman gives us an example of running data comparison using Biml:

When you build an ETL Solution, at some point you will most likely feel the need to compare the data between your source and your staging (or datawarehouse) database. There may be various reasons for them to be out of sync, such as delta loads, aggregations or added business logic, but one day your phone will ring and whoever is on the other end will tell you that the numbers are wrong. While this doesn’t necessarily have to be an issue within your ETL process, it might just as well be the report itself. In many cases, this is a good starting point to look at.
This article (formally published on PASS Community Blog) focusses on SSIS as your orchestrator, but the same principles could, obviously, also be applied to Azure Data Factory for example. Also, as we want to solve this task as lightweight as possible, we will, demonstrably, use Biml to implement it!

Click through to see how.

Comments closed

One Chart at a Time

Published 2021-01-28 by Kevin Feasel

Meagan Longoria has contributed to a series:

Jon Schwabish over at PolicyViz has created great initiative called the One Chart at a Time Video Series. It’s an effort to expand readers’ graphic literacy through short videos explaining how to read and use different charts. Each video is from a different person in the data visualization industry. Participants include people I admire such as Andy Kirk, Ben Jones, and Cole Nussbaumer Knaflic. Jon releases a new video each weekday. The initiative started January 11 and will continue through mid-March.

Click through for Meagan’s contribution on waterfall charts.

Comments closed

Using Hints Instead of Trace Flags

Published 2021-01-28 by Kevin Feasel

Taiob Ali has a set of hints to use instead of setting trace flags:

Introduced in SQL Server 2016 (13.x) SP1 and Azure SQL Database is the USE HINT option. At the time of writing this article, there are 23 of those. You can see an up to date list by running Dynamic Management View sys.dm_exec_valid_use_hints.
With the introduction of this feature, we can replace some of the trace flags with hints. I personally like hints over trace flags (if I have to choose between two devils). That way someone reading the code have some idea what the hint is about instead of remembering the trace flag numbers.

I agree with Taiob on that sentiment.

Comments closed

The Performance of sys.partitions

Published 2021-01-28 by Kevin Feasel

Aaron Bertrand tries to disentangle a query:

This question was posted to #sqlhelp by Jake Manske, and it was brought to my attention by Erik Darling.
I don’t recall ever having a performance issue with sys.partitions. My initial thought (echoed by Joey D’Antoni) was that a filter on the data_compression column should avoid the redundant scan, and reduce query runtime by about half. However, this predicate doesn’t get pushed down, and the reason why takes a bit of unpacking.

Click through for the results of Aaron’s research.

Comments closed

Threadpool Waits

Published 2021-01-28 by Kevin Feasel

Chad Callihan recommends not messing with worker threads:

Our story begins on a test server hosting a couple hundred databases. Over time, the server kept getting slower and slower when trying to navigate in SSMS. Query windows would take too much time to load, logging in would lag a bit, etc. When investigating with sp_Blitz, I found that the server had been experiencing Threadpool waits.

There are some cases where increasing thread count is important, especially when you’re working with database mirroring or availability groups. I worked with a customer with thousands of mirrored databases per server. None of the databases were particularly large or heavily-used, so it was on properly-sized hardware. As a result, to prevent the server from falling over due to threadpool waits, we had to scale thread counts to scary-high levels.

Comments closed

M	T	W	T	F	S	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Day: January 28, 2021