Press "Enter" to skip to content

Curated SQL Posts

Nested Loops, Hash, Or Merge: Which Is Best?

Grant Fritchey dodges the important questions:

First response, also a joke, was the question at the title of this post:

What is the preferred operator when joining tables: Hash Match, Nested Loops or Merge?

While my immediate response to this question is, yes. Meaning, they’re all preferred, situationally. I decided to expand on that a bit.

I completely agree with Grant:  there is no single best operator.  If there were, database companies wouldn’t have multiple options.  That said, in an ideal world, all joins would be merge joins; in our fallen world, nested loops and hash matches often prove superior second-best alternatives.

Comments closed

COUNT And NULL Values

Kenneth Fisher digs into the COUNT() function and sees how it deals with NULL values:

Count the number of values
1
2
3
SELECT COUNT(FieldName) FROM TableName;
-- or
SELECT COUNT(ALL FieldName) FROM TableName;

The ALL argument is the default and is unnecessary (I didn’t even know it existed until I started this post). Here you are counting the number of non NULL values in FieldName. So in a column with (1, NULL, 1, 2, 3, NULL, 1) you’ll get a count of 5. You do get a nice warning (depending on your ANSI_WARNINGS setting) if there was a NULL value though.

By the way, the ALL operator isn’t useful there, but can be useful along with its counterparts SOME and ANY.  I rarely keep them in my mind, so I instead tend to write EXISTS and NOT EXISTS statements which have operate on an equivalent function.

Comments closed

New Features In Public Preview On Azure SQL Database

Microsoft has a round of announcements for public previews on Azure SQL Database.  First up is Kevin Farlee announcing approximate count distinct:

The new APPROX_COUNT_DISTINCT aggregate function returns the approximate number of unique non-null values in a group.

This function is designed for use in big data scenarios and is optimized for the following conditions:

  • Access of data sets that are millions of rows or higher AND
  • Aggregation of a column or columns that have a large number of distinct values

Assuming these conditions, the accuracy will be within 2% of the precise result for a majority of workloads.

I’m liking this change.  Sometimes I simply need an approximate number  but I want it fast.

Shreya Verma announces MATCH support in the MERGE operator:

We will be further expanding the graph database capabilities with several new features. In this blog we will discuss one of these features that is now available for public preview in Azure SQL Database, MATCH support in MERGE DML for graph tables.

The MERGE statement performs insert, update, or delete operations on a target table based on the results of a join with a source table. For example, you can synchronize two tables by inserting, updating, or deleting rows in a target table based on differences between the target table and the source table. Using MATCH predicates in a MERGE statement is now supported on Azure SQL Database. That is, it is now possible to merge your current graph data (node or edge tables) with new data using the MATCH predicates to specify graph relationships in a single statement, instead of separate INSERT/UPDATE/DELETE statements.

I’ll use that approximately the day they fix all of the bugs with the MERGE operator.

Joe Sack announces row mode memory grant feedback:

In Azure SQL Database, we are further expanding query processing capabilities with several new features under the Intelligent Query Processing (QP) feature family.  In this blog post we’ll discuss one of these Intelligent QP features that is now available in public preview, row mode memory grant feedback.  Row mode memory grant feedback expands on the memory grant feedback feature by adjusting memory grant sizes for both batch and row mode operators.

Key feature benefits:

  • Reduce wasted memory. For an excessive memory grant condition, if the granted memory is more than two times the size of the actual used memory, memory grant feedback will recalculate the memory grant. Consecutive executions will then request less memory.

  • Decrease spills to disk. For an insufficiently sized memory grant that results in a spill to disk, memory grant feedback will trigger a recalculation of the memory grant. Consecutive executions will then request more memory.

This was big for batch mode operators, and I’m happy to see it move to row mode operators as well.

Finally, Joe also announces table variable deferred compilation:

In Azure SQL Database, we will be further expanding query processing capabilities with several new features under the Intelligent Query Processing (QP) feature family.  In this blog post we’ll discuss one of these Intelligent QP features that is now available in public preview in Azure SQL Database, table variable deferred compilation.

Table variable deferred compilation improves plan quality and overall performance for queries referencing table variables. During optimization and initial compilation, this feature will propagate cardinality estimates that are based on actual table variable row counts.  This accurate row count information will be used for optimizing downstream plan operations.

This one has the potential to be a pretty big performance improvement as well.

Comments closed

Creating The Ultimate Calendar Table

Daniel Hutmacher has started to create a comprehensive calendar table:

It’s a collection of inline table value functions that generate different types of calendars, with a number of properties that could be relevant for a calendar dimension. Each function has a unique date column, so you can join the functions you need together in a view or a procedure. The functions are:

  • Dates: a plain gregorian calendar.

  • Fiscal, annual: a gregorian, year-based calendar where you can define the start of a year, like a corporate fiscal calendar.

  • Fiscal, 4-4-5 or 52/53: a week-based calendar where years comprise four quarters of 4+4+5 weeks respectively.

  • Indian national calendar

  • Persian calendar

  • Thai calendar

  • Dates of Catholic and Orthodox easter

  • Lunar cycle

I was going to jokingly be shocked that this list didn’t include the Hebrew or Islamic calendars, but then Daniel had to ruin my fun by explaining why not.  Check it out and when you’re ready to give it a try, head over to his downloads page.

Comments closed

Testing Spatial Equilibrium Concepts With tidycensus

Ignacio Sarmiento Barbieri walks us through the concept of spatial equilibrium and tests using data from the tidycensus package:

Let’s take the model to the data and reproduce figures 2.1. and 2.2 of “Cities, Agglomeration, and Spatial Equilibrium”. The focus are two cities, Chicago and Boston. These cities are chosen because both differ in how easy is to access to their city centers. Chicago is fairly easy, Boston is more complicated. Our model then implies that gradients then should reflect the differential costs to access the city centers.

So let’s begin, the first step is to get some data. To do so I’m are going to use the “tidycensus” package. This package will allow me to get data from the census website using their API. We are also going to need the help of three other packages: “sf” to handle spatial data, “dplyr” my go-to package to wrangle data, and “ggplot2” to plot my results.

require("tidycensus", quietly=TRUE)
require("sf", quietly=TRUE)
require("dplyr", quietly=TRUE)
require("ggplot2", quietly=TRUE)

In order to get access to the Census API, I need to supply a key, which can be obtained from http://api.census.gov/data/key_signup.html.

Read on for theory and a test.  H/T R-bloggers

Comments closed

Interacting With SQL Server From Pandas

Tomaz Kastrun shows how to use pyodbc to interact with a SQL Server database from Pandas:

In the SQL Server Management Studio (SSMS), the ease of using external procedure sp_execute_external_script has been (and still will be) discussed many times. But the reason for this short blog post is the fact that, changing Python environments using Conda package/module management within Microsoft SQL Server (Services), is literally impossible. Scenarios, where you want to build  a larger set of modules (packages) but are impossible to be compatible with your SQL Server or Conda, then you would need to set up a new virtual environment and start using Python from there.

Communicating with database to load the data into different python environment should not be a problem. Python Pandas module is an easy way to store dataset in a table-like format, called dataframe. Pandas is very powerful python package for handling data structures and doing data analysis.

Click through for examples of reading and writing data.

Comments closed

HADR_DATABASE_WAIT_FOR_TRANSITION_TO_VERSIONING Wait Type

Chirag Shah explains what the HADR_DATABASE_WAIT_FOR_TRANSITION_TO_VERSIONING wait type really means:

Recently a customer reported an interesting issue, while querying against recently added readable replica, SELECT statement is shown as suspended and session is shown as waiting on HADR_DATABASE_WAIT_FOR_TRANSITION_TO_VERSIONING

[…]

Upon more investigation, it appeared to be waiting on with a wait type HADR_DATABASE_WAIT_FOR_TRANSITION_TO_VERSIONING

The behavior is by design as mention in the SQL Server product documentation and applicable to all version of SQL Server that supports availability group.

Read on for the explanation.

Comments closed

HASHBYTES On FOR JSON PATH Data

Greg Low walks us through a mechanism to check whether data has changed:

In a previous post, I wrote about how to determine if a set of incoming values for a row are different to all the existing values in the row, using T-SQL in SQL Server.

I later remembered that I’d seen a message by Adam Machanic a while back, talking about how FOR JSON PATH might be useful for this, so I did a little more playing around with it.

If you are using SQL Server 2016 or later, I suspect this is a really good option.

Click through to see an example using the WideWorldImporters database.

Comments closed

Generating Realistic-Looking Data With Markov Chains

Phil Factor shows how to use Markov chain generation in T-SQL to generate realistic-looking country names:

How did we do this? We started with a table that took each word, added two spaces at the beginning and a |, followed by two subsequent spaces, at the end. This allowed us to map the frequency of each three-letter combination in a collection of words. Any language is made up of common combinations of characters with a few wild exceptions. For words to look right, they must follow this distribution. This distribution will change in various parts of a word, so you need all this information.

So what would happen if, instead of feeding the name of countries into the batch, we do the names of people?

My favorite name from the list was Kuwatian Samoa.

Comments closed

Windows Server 2019: USB-Based File Share Witness Support

Dave Bermingham introduces us to a new feature coming in Windows Server 2019:

I’m very excited to hear that coming in Windows Server 2019 there will be a few new features in regards to the File Share Witness for the Failover Cluster Quorum. The feature that many of my customers have been asking for about for many years is finally arriving…File Share Witness on a USB stick!

Okay, they didn’t really ask for that specifically, but many of my customers wanted to deploy a simple 2-node cluster in each store location, branch office, etc., and they didn’t want the added expense of a SAN to leverage a Disk Witness and weren’t to keen, or just didn’t have the connectivity, to rely on a Cloud Witness in Azure. Many of these customers just decided to forgo clustering, or they used an alternative clustering solution like the SIOS Protection Suite.

Now they have a viable alternative coming in Windows Server 2019. By leveraging a supported router, a USB disk inserted into the router can be configured with a file share that can be used as the witness. This eliminates the need for a 3rd server or internet connectivity.

I can’t see this being extremely useful in most scenarios, though that could be a lack of imagination on my part.

Comments closed