Press "Enter" to skip to content

Month: March 2021

Executing Parameterized Notebooks via Azure Data Studio

Julie Koesmarno takes us through three methods for executing parameterized notebooks in Azure Data Studio:

In Feb 2021 release, Azure Data Studio (v1.26.1) has added parameterized URI execution. See the “Preview of passing parameters through URI” section and the Parameterization of Notebooks in Azure Data Studio on Microsoft Docs.

So, in total there are three ways of executing parameterized notebook (from another notebook). Check out the demo files here:

Click through for the notebooks.

Comments closed

Getting Started with data.table

Gary Hutson has a primer on data.table:

This example uses the copy data frame we made and uses the organisation code by the type of attendances. I want to then summarise the mean admissions by type and organisation code.

Pivots can be implemented in data.table in the following way:

I’ve never been the biggest fan of the syntax for data.table but the performance is unquestionably there and that makes it worth learning. H/T R-bloggers.

Comments closed

Execution Plan Operator Timings

Paul White dives into a murky problem:

SQL Server 2014 SP2 and later produce runtime (“actual”) execution plans that can include elapsed time and CPU usage for each execution plan operator (see KB3170113 and this blog post by Pedro Lopes).

Interpreting these numbers is not always as straightforward as one might expect. There are important differences between row mode and batch mode execution, as well as tricky issues with row mode parallelism. SQL Server makes some timing adjustments in parallel plans to promote consistency, but they are not perfectly implemented. This can make it difficult to draw sound performance-tuning conclusions.

This article aims to help you understand where the timings come from in each case, and how they can be best interpreted in context.

My default advice is to read what Paul White writes, and this is no exception.

Comments closed

Getting the Stack Overflow Database for Demos

Brent Ozar has an update on getting the latest version of the Stack Overflow data set:

If you only have a limited amount of bandwidth, you don’t have to keep seeding the database after you get it – I’ve got it hosted on a handful of seedboxes around the world.

As with the original data dump, these are provided under cc-by-sa 4.0 license. That means you are free to share it and adapt it for any purpose, even commercially, but you must attribute it to the original authors (not me):

Click through for those attribution details as well as links to get the demo database in various sizes.

Comments closed

The Unique Properties of DateTimeOffset

Rob Farley analyzes a special data type:

And as I have a unique index on this, it won’t let me insert 00:30 in UTC+11, because 00:00 in UTC+10:30 is already there. It tells me “Msg 2627, Level 14, State 1, Line 19. Violation of PRIMARY KEY constraint ‘pkTimesOffset’. Cannot insert duplicate key in object ‘dbo.TimesOffsets’. The duplicate key value is (2021-01-01 00:30:00.0000000 +11:00).”

My general rule is to store everything in SQL Server as UTC. If I did not do this, I would very strongly advocate for using DateTimeOffsets regardless of the extra data length. I’ve experienced the pain of mismatched date and time details one too many times for that.

Fun bonus fact: the same applies to .NET as well. If I control the system, I’m using DateTime.UtcNow for everything. If not, I’m leaning heavily toward DateTimeOffset by default. Again, too many times have I experienced that source system X has times marked in Pacific Standard Time pushing data to a server in Eastern Standard Time, and then mixing in a server based in Central Standard Time and having people confused because “the times are wrong.”

Comments closed

A Review of Numerical Data Types

Lina Kovacheva reviews SQL Server’s numerical data types:

SQL Server provides two dedicated data types for storing monetary values. You can think of MONEY and SMALLMONEY as DECIMAL – Money is effectively the same as DECIMAL(19,4) while SMALLMONEY is effectively the same as DECIMAL(10,4). If you are planning to use MONEY you should have in mind that performing division and multiplication can cause rounding errors that result in the unintentional loss of precision. The cause of the problem is that MONEY only saves information up to the 4th decimal place and if your multiplication or division results in an integer that goes to the 5th decimal place or more, MONEY will round it off, causing an accuracy error.

Click through for an analysis of these data types.

Comments closed

SQL Server Data Types: Bit vs the World

Kevin Chant is a fan of the bit type:

I decided to tweak it a bit for this post, to provide a humorous comparison between the bit data type and others that are available in SQL Server. In reality, this won’t cover every single one.

By the end of this post, you will some see pitfalls to using certain data types and some tips on how to avoid them. Plus, you will find out which data type is my least favourite to use.

Click through for a “haha-just-serious” take on an underappreciated datatype which nonetheless can’t decide if it’s a boolean or not.

Comments closed

Approximate Count Distinct

Deepthi Goguri looks at the APPROX_COUNT_DISTINCT() function in SQL Server 2019:

In the previous post, we learned about Scalar UDF Inliningfeature. In this post, let’s focus on the Approximate QP with Approx_Count_Distinct feature introduced as a part of SQL Server 2019 Intelligent Query Processing features. This feature is useful to get the approximate count of distinct values just like the  Count distinct function to get the distinct number of records but this new feature will take less amount of CPU and memory to process the query.

Click through for more info. The algorithm itself is also pretty interesting.

Comments closed

Spring Cleaning Shiny Projects

Mirai Solutions has some tips on cleaning up Shiny apps:

How to apply the spring cleaning principles and advanced programming to your Shiny App.

1. Deep breeze and allocate some time

Do not avoid spring cleaning simply because you don’t know where to start from. Prioritize some time for the task and get inspired by our following points.

Click through for advice on tools and processes to make this code easier to understand. H/T R-Bloggers

Comments closed

Hash Distributions, Shuffling, and Data Types

Reiss McSporran explains an issue in Azure Synapse Analytics dedicated SQL pools:

Oh dear. Shuffle. On the surface, shuffle is not necessarily a bad thing and may even be expected, but not in this scenario. So what is shuffle and why can it be bad?

Shuffle occurs when a part of a distributed table is moved to a different node during query execution. To do this a hash value is computed using the join columns, the node is then found that has that hash value and the row is then sent to that node for processing. If the tables you’re joining on have different hash distributions, you’d expect this to happen, (hence why it’s not always a bad thing!), but in this case I’ve used the same value to hash both of my tables, so what’s gone wrong?

Click through to see what’s gone wrong.

Comments closed