Press "Enter" to skip to content

Author: Kevin Feasel

Against Abused Data Types

Reitse Eskens hates misused data types:

First up. A large amount of my work has to do with ETL processes. There are a lot of things that can go wrong there, but one of the main issues is wrong estimations on size. When we read data from a source system to transfer it to the datawarehouse environment we have to match the datatypes. A varchar(10) in the source will have to be a varchar(10) in the target. Easy enough. But now the source gets an update and with that update the source datatype goes from varchar(10) to varchar(12). When the supplier informs us, we change the datatype accordingly and everything is fine. When for some reason the update is missed, issues will arise. Because off course it’s the primary key that got enlarged and duplicates will start to form.

The other way around happens as well. Some tools check out the source, see a varchar column and, when no-one notices, will create a nvarchar(2000) column. Joy will arise when this column contains one or two characters when the optimizer expects at least a thousand characters.

I’m in almost complete agreement with this notion, with the exception that I think sql_variant is an abomination and its existence in a database is ipso facto proof that the designer came up with (or was forced into) a bad solution.

Comments closed

A Mini-Lab: SQL Server 2019 on Docker

Ayman El-Ghazali has a three-parter for us. Part one involves installing Docker for Desktop on Windows and creating a SQL Server container:

Recently, with the help of a colleague at work, I’ve started to dabble a little with containers. I had a customer that requested some specific code to be tested, and I realized that I didn’t have my own local instance of SQL running (always good to have a local one). I decided to try to make this process easier instead of going the traditional route of creating a Virtual Machine and also to help me learn a new technology. In these series of posts, I’m going to document my process of creating a Mini Data Lab for SQL Server on my desktop using Docker. It is intended to be for beginners and in no way is an article for best practices or production deployments.

Part two includes persistent storage and some of the other niceties of hosting a database in a container:

Let’s first take a look at the way I have my disk/folder structure laid out. Again, this is on my personal computer so it’s not a best practice for production and more suitable for development environments.

For each container, I’m creating a separate folder with the MSSQL paths that I need to put my databases, transaction log, and backup files on. Additionally, under the DockerMount folder I have a folder called sqldockershared (which I will put some shared content in later).

Part three is about configuration in existing containers and deploying a second container side-by-side:

For those that are more curious in changing other Instance level properties here is a list of configurable properties for SQL Server on Linux via Bash.
https://docs.microsoft.com/en-us/sql/linux/sql-server-linux-configure-mssql-conf?view=sql-server-ver15

Now we have a great foundation to create another container, so let us go and do that now. The code is similar to the previously created container with the exception of the file path for the data, log, and back files and the port number. The SQLShare path will be the same so that we can run our initialization script from there.

I’m bought-in on containers. There are still some pains around containers for production databases, but “some pain” is a much better experience than a few years ago, when the answer to the question of whether you want to use containers in production for databases was “Are you mad?”

Comments closed

Row-Level Security and UseRelationship

Teo Lachev points out an issue when combining row-level security with the USERELATIONSHIP() function in a Tabular model:

You’ve created a beautiful, wide-open Tabular model. You use USERELATIONSHIP() to switch relationships on and off. Everything works and everyone is pleased. Then RLS sneaks in, such as when external users need access, and you must secure on some dimension table. You create a role, specify a row filter, test the role, and get greeted with:

The UseRelationship() and CrossFilter() functions may not be used when querying ‘<dimension table>’ because it is constrained by row-level security defined on ‘<dimension table>’ or related tables.

Read on to learn what the issue is and one potential workaround.

Comments closed

The Geography Data Type

Greg Dodd talks about the GEOGRAPHY data type:

If you read through the docs on data types, you hit the “Other data types” section, and you start scratching your head: cursor, hierarchyid, rowversion. I’m sure XML will have a special place for some (love or hate).

The datatype that I think is the coolest has to be Spatial Geography. Under the covers, it’s probably the same as Geometry, but no where near as much fun. What is Geography? It’s what it sounds like – a way to store Latitude and Longitude data that will let you query it back again and plot it on a map, or measure distances from it.

Also read Rob Farley’s note on the topic, which mirrors my thoughts: spatial data types are quite relevant for comparison work. And a bit of data manipulation in the database can save a lot of network traffic.

Comments closed

In Praise of the XML Data Type

Eitan Blumin likes an underappreciated data type:

The xml data type in SQL Server exists since about SQL Server 2005, and it introduced a lot of very powerful and useful capabilities that were never before seen in SQL Server. It was the first-ever data type with built-in CLR methods (i.e. where you can write a dot after a column name and execute some kind of method. For example: mycolumn.nodes(…)).

The xml data type and the functionality around it made it relatively easy to “refactor” a resultset from a relational structure into a scalar structure (i.e. a single XML document) using the FOR XML directive, and vice versa (single XML document into a relational structure) using the nodes()value(), and query() methods.

Show me a thousand posts and I don’t expect to see XML show up as a favorite type more than three or four times. Eitan also shares a least favorite type, and I do expect that one to show up on the most-hated list quite frequently.

Comments closed

Getting Started with data.table

Gary Hutson has a primer on data.table:

This example uses the copy data frame we made and uses the organisation code by the type of attendances. I want to then summarise the mean admissions by type and organisation code.

Pivots can be implemented in data.table in the following way:

I’ve never been the biggest fan of the syntax for data.table but the performance is unquestionably there and that makes it worth learning. H/T R-bloggers.

Comments closed

Executing Parameterized Notebooks via Azure Data Studio

Julie Koesmarno takes us through three methods for executing parameterized notebooks in Azure Data Studio:

In Feb 2021 release, Azure Data Studio (v1.26.1) has added parameterized URI execution. See the “Preview of passing parameters through URI” section and the Parameterization of Notebooks in Azure Data Studio on Microsoft Docs.

So, in total there are three ways of executing parameterized notebook (from another notebook). Check out the demo files here:

Click through for the notebooks.

Comments closed

Getting the Stack Overflow Database for Demos

Brent Ozar has an update on getting the latest version of the Stack Overflow data set:

If you only have a limited amount of bandwidth, you don’t have to keep seeding the database after you get it – I’ve got it hosted on a handful of seedboxes around the world.

As with the original data dump, these are provided under cc-by-sa 4.0 license. That means you are free to share it and adapt it for any purpose, even commercially, but you must attribute it to the original authors (not me):

Click through for those attribution details as well as links to get the demo database in various sizes.

Comments closed

Execution Plan Operator Timings

Paul White dives into a murky problem:

SQL Server 2014 SP2 and later produce runtime (“actual”) execution plans that can include elapsed time and CPU usage for each execution plan operator (see KB3170113 and this blog post by Pedro Lopes).

Interpreting these numbers is not always as straightforward as one might expect. There are important differences between row mode and batch mode execution, as well as tricky issues with row mode parallelism. SQL Server makes some timing adjustments in parallel plans to promote consistency, but they are not perfectly implemented. This can make it difficult to draw sound performance-tuning conclusions.

This article aims to help you understand where the timings come from in each case, and how they can be best interpreted in context.

My default advice is to read what Paul White writes, and this is no exception.

Comments closed