Press "Enter" to skip to content

Curated SQL Posts

Databricks UDF Performance Testing

Tristan Robinson shares some performance comps for different Azure Databricks scenarios:

I’ve recently been spending quite a bit of time on the Azure Databricks platform, and while learning decided it was worth using it to experiment with some common data warehousing tasks in the form of data cleansing. As Databricks provides us with a platform to run a Spark environment on, it offers options to use cross-platform APIs that allow us to write code in Scala, Python, R, and SQL within the same notebook. As with most things in life, not everything is equal and there are potential differences in performance between them. In this blog, I will explain the tests I produced with the aim of outlining best practice for Databricks implementations for UDFs of this nature.

Scala is the native language for Spark – and without going into too much detail here, it will compile down faster to the JVM for processing. Under the hood, Python on the other hand provides a wrapper around the code but in reality is a Scala program telling the cluster what to do, and being transformed by Scala code. Converting these objects into a form Python can read is called serialisation / deserialisation, and its expensive, especially over time and across a distributed dataset. This most expensive scenario occurs through UDFs (functions) – the runtime process for which can be seen below. The overhead here is in (4) and (5) to read the data and write into JVM memory.

Click through for the results.  Looks like Python barely beat out Scala for the #1 position, but Scala was a little faster than Python in-class (e.g., the Scala program with a Scala SQL UDF was a little bit faster than the Python equivalent).

Comments closed

Working With Dates And Times In T-SQL

Tomaz Kastrun walks us through various functions to work with dates and times in T-SQL:

Manipulating date and time in T-SQL is a daily and very common task that every DBA, SQL Developer, BI Developer and data scientist will come across. And over the years, I have accumulated many of the simple date or/and time manipulation combinations of different functions, that it is time, to put them together.

Don’t expect to find here anything you haven’t used or seen – especially, if you are a long time T-SQL developer. The point is to have a post, that will have a lot of examples on date and time manipulation on one place. And by no means, this is not the definite list, but should be quite substantial and the code on Github repository will be update.

The list will be updated on my Github, and therefore this blogpost might not include all. In all of the following examples I will be using function GETDATE() to get the current datetime, unless the examples will have stored dates. Therefore, some of the examples or screen-prints will be different from yours.

This mostly focuses on the DATETIME type rather than DATETIME2 or DATE, but there are a few TIME uses.  Check out Tomaz’s repo for more.

Comments closed

SSMS 17.9 Released

Alan Yu announces a new version of SQL Server Management Studio:

SSMS 17.9 provides support for almost all feature areas on SQL Server 2008 through the latest SQL Server 2017, which is now generally available.

In addition to enhancements and bug fixes, SSMS 17.9 comes with several new features:

  • ShowPlan improvements
  • Azure SQL support for vCore SKUs
  • Bug Fixes

View the Release Notes for more information.

It looks like the big push for this release was bug fixes, and there are quite a few of them.

Comments closed

Quartiles In DAX

Dustin Ryan shows us how to calculate quartiles using DAX:

To calculate the quartile, we’re going to use the PERCENTILEX.INC DAX function. The PERCENTILEX.INC function returns the number at the specified percentile. So for example, if I had numbers 0 and 100 in my data set, the 25th percentile value would be 25. The 50th percentile value would be 50 and the 75th percentile value would be 75, and you can figure out what the 100th percentile value would be.

Dustin shares an example with his NFL data set and also walks us through a couple of tricky situations.

Comments closed

SQL On Linux AD: Group Membership Issues

Dylan Gray and Tejas Shah continue their series on troubleshooting issues when connecting to SQL Server on Linux via an Active Directory account:

Imagine a scenario where after a successful AD login and running a couple queries, some users may see the error “Could not obtain information about Windows NT group/user ‘CONTOSO\user’.” This is due to a failure when searching for group memberships for the logged in user and can be easily fixed.

When a user logs in, their group memberships are looked up and used to determine if they have the privileges to login. Once the user is connected SQL Server must validate their group memberships in many scenarios, to make sure their effective access permissions have not changed. For example, if user CONTOSO\user1 was a member of CONTOSO\group1, and CONTOSO\group1 has login permission for the SQL Server instance, then CONTOSO\user1 can login successfully. However, if after CONTOSO\user1 logs in, they are removed from CONTOSO\group1 by a domain admin, then their access to SQL Server should be revoked.

Click through to see what SQL Server on Linux uses to check AD group information and what you can do if there’s a problem.

Comments closed

New dbatools Release

Chrissy LeMaire announces new changes to dbatools:

Marching onward to dbatools 1.0, a ton of commands have been renamed to align with our now mature naming scheme. These changes were made in today’s release, version 0.9.410 aka regularlegs, now available on GitHub and the PowerShell Gallery.

Here’s the general idea:

  • DbaDatabase has mostly been renamed to DbaDb with a couple exceptions

  • DbaSql has been changed to just Dba, no exceptions. Think of Dba as “Sql”.

  • DbaConfig has been renamed to DbatoolsConfig

  • TempDbConfiguration has been renamed to TempdbConfig

  • All Configuration commands are Config except SpConfigure

  • DbaDacpac has been renamed to DbaDacPackage. Dac is the prefix for our data-tier application commands.

  • DbaDbQueryStoreOptions has been renamed to DbaDbQueryStoreOption

Some of this is in preparation for breaking changes in dbatools 1.0.  There’s a lot of stuff in this release, so check it out.

Comments closed

Migrating Azure SQL Databases Between Resource Groups

Arun Sirpal shows us a method for moving Azure SQL Databases between resource groups:

In my mind there are a couple of ways to move a database across resource groups. They vary from scripting to just using the Azure portal. I am going to use the Azure portal and do the following.

  1. Export a database in resource group X to a storage account Z.
  2. Import the file from the storage account Z into a database that is in resource group Y.

It’s just like a “backup and restore” strategy, all with the assumption that you are working within the same subscription ID.

Read on for a step-by-step demonstration on how to do this.

Comments closed

Animating With gganimate

The folks at Jumping Rivers show how to use gganimate to, in this case, track goal differences for Premier League teams:

Now we can see not only when Arsenal picked up points, but when they dropped points as well. For example, on the 27th of August, they got beat by 4 goals as their goal difference shifted from 0 to -4.

We’re not done there! For the gif, we want to be able to display the current status of the team on each day i.e. Champions League (4th or above), Europa League (5th – 7th), Top Half (8th – 10th), Bottom Half (11th – 17th) or Relgations Zone (18th or below). To do this, on each day, we first need to retrieve the order of each team based on their points and goal difference

Click through to see the example.

Comments closed

Why Graph Text Should Be Horizontal

Stephanie Evergreen explains why you don’t want diagonal or vertical text in your charts:

In languages based on the Latin alphabet, we read horizontally, from left to right. Reading on a diagonal produces cramped necks. Reading vertical text is just not going to happen. So, as much as possible, the text in our graphs should be horizontal.

Let’s walk through a demo. I was trying to combat my sense of hopelessness about the world by exploring the latest dataviz related to the Sustainable Development Goals (don’t we all?). The vast majority of their viz is pretty awesome, especially given how complicated the data can be. I saw this graph about how few developing countries have representation on international development councils.

Watch as Stephanie starts with a column chart with diagonal (and cut-off) text and converts it to a visual which is much easier to read.

Comments closed

Range Locks On Multi-Table Indexed Views

Erik Darling looks at the kinds of locks taken when updating an indexed view:

So what causes Range Locks? Just ask Sunil. He knows everything (this assumes the serializable isolation level):

Equality Predicate

If the key value exists, then the range lock is only taken if the index is non-unique. In the non-unique index case, the ‘range’ lock is taken on the requested key and on the ‘next’ key.

If the ‘next’ key does not exist, then a range lock is taken on the ‘infinity’ value. If the index is unique then a regular S lock on the key.

If the key does not exist, then the ‘range’ lock is taken on the ‘next’ key both for unique and non-unique index.

If the ‘next’ key does not exist, then a range lock is taken on the ‘infinity’ value.

Range Predicate (key between the two values)

‘range lock on all the key values in the range when using ‘between’

‘range’ lock on the ‘next’ key that is outside the range. This is true both for unique and non-unique indexes. This is to ensure that no row can be inserted between the requested key and the one after that. If the ‘next’ key does not exist, then a range lock is taken on the ‘infinity’ value.

Erik has an interesting example and lets us see a potential concurrency problem with multi-table indexed views.

Comments closed