2021-03-10 – Curated SQL

Date Cleaning with PySpark

Published 2021-03-10 by Kevin Feasel

Robert J. Blackburn needs to do some cleanup work:

The function will accept the dataframe and a list of columns to process. Because of syntax restrictions the steps are broken out into multiple statements and a sub-function. Luckily, Spark’s lazy evaluation will optimize it during runtime.

Click through for the function in question.

Comments closed

A Mini-Lab: SQL Server 2019 on Docker

Published 2021-03-10 by Kevin Feasel

Ayman El-Ghazali has a three-parter for us. Part one involves installing Docker for Desktop on Windows and creating a SQL Server container:

Recently, with the help of a colleague at work, I’ve started to dabble a little with containers. I had a customer that requested some specific code to be tested, and I realized that I didn’t have my own local instance of SQL running (always good to have a local one). I decided to try to make this process easier instead of going the traditional route of creating a Virtual Machine and also to help me learn a new technology. In these series of posts, I’m going to document my process of creating a Mini Data Lab for SQL Server on my desktop using Docker. It is intended to be for beginners and in no way is an article for best practices or production deployments.

Part two includes persistent storage and some of the other niceties of hosting a database in a container:

Let’s first take a look at the way I have my disk/folder structure laid out. Again, this is on my personal computer so it’s not a best practice for production and more suitable for development environments.
For each container, I’m creating a separate folder with the MSSQL paths that I need to put my databases, transaction log, and backup files on. Additionally, under the DockerMount folder I have a folder called sqldockershared (which I will put some shared content in later).

Part three is about configuration in existing containers and deploying a second container side-by-side:

For those that are more curious in changing other Instance level properties here is a list of configurable properties for SQL Server on Linux via Bash.
https://docs.microsoft.com/en-us/sql/linux/sql-server-linux-configure-mssql-conf?view=sql-server-ver15
Now we have a great foundation to create another container, so let us go and do that now. The code is similar to the previously created container with the exception of the file path for the data, log, and back files and the port number. The SQLShare path will be the same so that we can run our initialization script from there.

I’m bought-in on containers. There are still some pains around containers for production databases, but “some pain” is a much better experience than a few years ago, when the answer to the question of whether you want to use containers in production for databases was “Are you mad?”

Comments closed

Against Abused Data Types

Published 2021-03-10 by Kevin Feasel

Reitse Eskens hates misused data types:

First up. A large amount of my work has to do with ETL processes. There are a lot of things that can go wrong there, but one of the main issues is wrong estimations on size. When we read data from a source system to transfer it to the datawarehouse environment we have to match the datatypes. A varchar(10) in the source will have to be a varchar(10) in the target. Easy enough. But now the source gets an update and with that update the source datatype goes from varchar(10) to varchar(12). When the supplier informs us, we change the datatype accordingly and everything is fine. When for some reason the update is missed, issues will arise. Because off course it’s the primary key that got enlarged and duplicates will start to form.
The other way around happens as well. Some tools check out the source, see a varchar column and, when no-one notices, will create a nvarchar(2000) column. Joy will arise when this column contains one or two characters when the optimizer expects at least a thousand characters.

I’m in almost complete agreement with this notion, with the exception that I think sql_variant is an abomination and its existence in a database is ipso facto proof that the designer came up with (or was forced into) a bad solution.

Comments closed

Row-Level Security and UseRelationship

Published 2021-03-10 by Kevin Feasel

Teo Lachev points out an issue when combining row-level security with the USERELATIONSHIP() function in a Tabular model:

You’ve created a beautiful, wide-open Tabular model. You use USERELATIONSHIP() to switch relationships on and off. Everything works and everyone is pleased. Then RLS sneaks in, such as when external users need access, and you must secure on some dimension table. You create a role, specify a row filter, test the role, and get greeted with:
The UseRelationship() and CrossFilter() functions may not be used when querying ‘<dimension table>’ because it is constrained by row-level security defined on ‘<dimension table>’ or related tables.

Read on to learn what the issue is and one potential workaround.

Comments closed

The Geography Data Type

Published 2021-03-10 by Kevin Feasel

Greg Dodd talks about the GEOGRAPHY data type:

If you read through the docs on data types, you hit the “Other data types” section, and you start scratching your head: cursor, hierarchyid, rowversion. I’m sure XML will have a special place for some (love or hate).
The datatype that I think is the coolest has to be Spatial Geography. Under the covers, it’s probably the same as Geometry, but no where near as much fun. What is Geography? It’s what it sounds like – a way to store Latitude and Longitude data that will let you query it back again and plot it on a map, or measure distances from it.

Also read Rob Farley’s note on the topic, which mirrors my thoughts: spatial data types are quite relevant for comparison work. And a bit of data manipulation in the database can save a lot of network traffic.

Comments closed

In Praise of the XML Data Type

Published 2021-03-10 by Kevin Feasel

Eitan Blumin likes an underappreciated data type:

The xml data type in SQL Server exists since about SQL Server 2005, and it introduced a lot of very powerful and useful capabilities that were never before seen in SQL Server. It was the first-ever data type with built-in CLR methods (i.e. where you can write a dot after a column name and execute some kind of method. For example: mycolumn.nodes(…)).
The xml data type and the functionality around it made it relatively easy to “refactor” a resultset from a relational structure into a scalar structure (i.e. a single XML document) using the FOR XML directive, and vice versa (single XML document into a relational structure) using the nodes(), value(), and query() methods.

Show me a thousand posts and I don’t expect to see XML show up as a favorite type more than three or four times. Eitan also shares a least favorite type, and I do expect that one to show up on the most-hated list quite frequently.

Comments closed

Things to Like about Integers

Published 2021-03-10 by Kevin Feasel

Brent Ozar is a fan of integers:

I love integers.
Furthermore, I think we’ve all been taking them for granted.
Integers are clean, elegant datatypes that do exactly what they’re supposed to. They’re the Golden Retrievers of the database world: they have a few shortcomings, but so, so few. And so for this month’s T-SQL Tuesday, I wanna share my love for integers with you.

Click through for four things to like.

Comments closed

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

Day: March 10, 2021

Date Cleaning with PySpark

A Mini-Lab: SQL Server 2019 on Docker

Against Abused Data Types

Row-Level Security and UseRelationship

The Geography Data Type

In Praise of the XML Data Type

Things to Like about Integers