Curated SQL – Page 653 – A Fine Slice Of SQL Server

Finding Duplicates in Type 2 SCDs

Published 2022-05-31 by Kevin Feasel

Dinesh Asanka wants to verify some Type 2 slowly changing dimension results:

As we discussed in a previous article, Implementing Slowly Changing Dimensions (SCDs) in Data Warehouses, there are three main types of slowly changing dimensions, such as Type 1, Type 2, and Type 3. Out of these Type 1 is the simple dimension where you will simply maintain only the latest version of the attribute. For example, if the employee got promoted to Senior Software Engineer from Software Engineer, you will simply overwrite the existing value to the new value so that the historical aspect is lost.
Type 2 Slowly Changing Dimensions are used to track historical data in a data warehouse. This is the most common approach in dimension. This article uses a sample database of AdventureworksDW which is the sample database for the data warehouse.

Click through for one way to compare, one which you could build using dynamic SQL.

Comments closed

When to Use a Map Visual

Published 2022-05-30 by Kevin Feasel

Mick Cisneros explains when to use map visuals:

That ubiquity has given all of us an increased familiarity with maps, as well as a deeper affinity for them. (Probably a dependence as well!) It’s natural, then, to want to use a map to visualize data that has a geographic dimension. Why not, right? There is an obvious upside: audiences are drawn to the way they look, as it’s a more memorable image than the same old bar chart or line graph. Not to mention: it’s fun to make maps!
The problem is that maps look interesting, but their very nature limits our options for visualizing data within them. Per a recent paper by Franconeri, Padilla, Shaw, et. al., here are a couple of the comparisons that people are very good at making, perceptually:

Read on for a comparison of good map versus bad map. Just because something has a geographical component doesn’t mean you should map it.

Comments closed

Model Deployment Options in Azure

Published 2022-05-30 by Kevin Feasel

Tori Tompkins enumerates ways to deploy machine learning models in Azure:

There are so many options to deploy models in Azure that is can get quite overwhelming. In this blog, we break down all the available options and consider the pros and cons of each tooling option.

Even with those, there are other approaches as well, like hosting Spark-based models in Azure Synapse Analytics, using SQL Server Machine Learning Services on an Azure SQL Managed Instance or VM running SQL Server, etc.

Comments closed

Obscure Changes in SQL Server 2022

Published 2022-05-30 by Kevin Feasel

Aaron Bertrand has a three-parter on obscure changes in SQL Server 2022. First up we have some new information:

You can get the marketing blitz from just about anywhere, and the What’s New documentation for the bigger hitters from the technical side.
But what about the changes that aren’t on the highlight reel at Build and aren’t getting all the attention from the media blitz? I’m a details person, so I get a lot of insight looking around at the little, non-headline-generating things that have changed. I’ve shown before how to sneak a peek under the hood, and I’m going to do it again today:

Then we have feature selection changes:

You will notice some changes in the Feature Selection screen. Some of the options have been consolidated; for example, you now get R, Python, and Java with MLS, instead of picking. One item has been added: SQL Server Extension for Azure. Unfortunately, this option is checked by default in the CTP 2.0 version of setup (click to enlarge):

Finally, we have execution plan changes:

As a former technical product manager for Plan Explorer, I can’t help but snoop around in what has changed in the XSD for showplan. Even though I am not the best person to actually analyze what those changes mean, and even though changes in XSD don’t necessarily reflect changes the engine can produce right now – these usually lay the groundwork for engine changes that will happen later.

There’s quite a lot of information available for those willing to dig, as Aaron shows.

Comments closed

Bounding Box Queries in Azure Data Explorer

Published 2022-05-30 by Kevin Feasel

David Giard draws boxes:

For our current project, we are capturing into ADX the location of vehicles over time. Our customer asked us to create a function that would return all vehicles that are within a given bounding box in a given time period. This is useful information when they want to know when a vehicle returns to a building, a neighborhood, or a city.
In this article, I will show how this can be accomplished using built-in functions, the limitations of those functions, and ways to overcome those limitations.

Read on for the naive approach as well as a very interesting one using S2 cells.

Comments closed

Thoughts on Contained Availability Groups

Published 2022-05-30 by Kevin Feasel

Allan Hirt is quite happy with contained Availability Groups:

All SQL Server availability features except Always On Failover Cluster Instances (FCIs) have a “problem”: when a secondary replica/warm standby/mirror (the term is different for each feature …) takes over as the new boss, some items such as SQL Server Agent jobs, instance level logins, etc., are not there. Going back to the early days of SQL Server when log shipping was not even in the product, this was always a manual process. There are multiple ways to approach this challenge and I am not going to detail them. This “problem” is a longstanding pain point with those who are responsible for managing SQL Server.
Contained AGs solve this issue by having their own master and msdb databases synchronized as part of the AG mechanism.

Click through to see how.

Comments closed

The Benefits of Parquet

Published 2022-05-30 by Kevin Feasel

Maria Zakourdaev explains why the Parquet file format is so useful:

Parquet files organize data in columns, while CSV files organize data in rows.

Columnar storage allows much better compression so Parquet data files need less storage, 1 TB of CSV files can be converted into 100GB of parquet files – which can be a huge money saver when cloud storage is used. This also means that scanning parquet file is much faster than scanning CSV files – fewer data would be scanned and there is no need to load unneeded columns into memory and aggregations will run faster. Parquet files contain both data and metadata, information about data schema and structure. When you load the file, having metadata helps the querying tool define proper data types.

Click through for an example of when Parquet makes sense. It’s not the best format for everything—it’s a columnar file format, so writes are typically slower than row-store formats like CSV or Avro—but it and ORC are outstanding for analytical processing, not least because of the metadata these formats contain.

Comments closed

IF Branches and Dynamic SQL

Published 2022-05-30 by Kevin Feasel

Erik Darling takes us through the scenic route:

I’m going to use the example from yesterday’s post to show you what you can do to further optimize queries like this.
To make the code fit in the post a little better, I’m going to skip the IF branch for the Posts table and go straight to Votes. Using dynamic SQL here will get you the same behavior at stored procedures, though.

Read on for more detail and a wrap-up of Erik’s series on conditional branching logic and performance tuning.

Comments closed

Applying Forecasts with Known Seasonal Behavior in Power BI

Published 2022-05-30 by Kevin Feasel

Imke Feldmann begins a series on seasonality in Power BI:

Please note, that the monthly variation I am using in this example is taken from an existing table and not derived by statistical methods like the Excel FORECAST.ETS.SEASONALITY function. Here, we simply apply an existing distribution to expected future sales, but I will cover forecasts using those statistical methods in upcoming blogposts.
Todays starting point is a request to calculate how many sales to expect until the end of the year and then distribute the expected sales according to a list of monthly %, which are all different, because there is an expected seasonality in the sales:

Click through to see how this all works.

Comments closed

OCR and Character Extraction with R

Published 2022-05-27 by Kevin Feasel

Benjamin Smith analyzes a text:

Since the text that I’m using has with two columns per page, the text will need to be cropped by columns before OCR is applied. Prior to that, the .pdf files will need to be converted to .png format.

Read on to see the code for the entire process, using the tidyverse, magick, and tesseract packages.

Comments closed

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Curated SQL Posts