Working With Jupyter Notebooks And Airflow On Hadoop

Mark Litwintschik shows us an interesting demonstration of running Jupyter Notebooks as well as automating tasks with Airflow on Hadoop:

The following will create a ~/airflow folder, setup a SQLite 3 database used to store Airflow’s state and configuration set via the Web UI, upgrade the configuration schema and create a folder for the Python-based jobs code Airflow will run.

$ cd ~
$ airflow initdb
$ airflow upgradedb
$ mkdir -p ~/airflow/dags

By default Presto’s Web UI, Spark’s Web UI and Airflow’s Web UI all use TCP port 8080. If you launch Presto after Spark then Presto will fail to start. If you start Spark after Presto then Presto will launch on 8080 and the Spark Master Server will take 8081 and keep trying higher ports until it finds one that is free. Spark will then pick an even higher port number for the Spark Worker Web UI. This overlap normally isn’t an issue as in a production setting these services would normally live on separate machines.

Read the whole thing.

How DynamoDB Indexing Works

Shubham Agarwal explains how indexing works within DynamoDB:

Global secondary index in DynamoDb – An index with a partition key and a sort key that can be different from the base table. A global secondary index is very helpful when you need to query your data without primary key.

  •  The primary key of a global secondary index can be partition key or composite  (partition key and sort key).

  • Global secondary indexes can be created at the same time that you create a table. You can also add a new global secondary index to an existing table, or delete an existing global secondary index

  • A global secondary index lets you query over the entire table, across all partitions.

  • The index partition key and sort key (if present) can be any base table attributes of type string, number, or binary.

  • With global secondary index queries or scans, you can only request the attributes that are projected into the index. DynamoDB will not fetch any attributes from the table.

  • There are no size restrictions for global secondary indexes.

Click through to learn more about these as well as local secondary indexes.

Auto-Generated Filtered Stats

Dmitry Piliugin shows one way that we can optimize a query whose main problem is poor row estimates:

A model variation is a new concept in the cardinality estimation framework 2014, that allows easily turn on and off some model assumptions and cardinality estimation algorithms. Model variations are based on a mechanism of pluggable heuristics and may be used in special cases. I think they are left for Microsoft support to be able to address some client’s CE issues pointwise.

Today we are going to view some interesting model variation, that creates filtered statistics on-the-fly. I should give a disclaimer here.

Warning: All the information below is presented for purely educational and curiosity purposes. This is completely undocumented and unsupported and should not ever be used in production systems unless Microsoft support will recommend you. More to the point, the usage of this model variation may affect the overall server performance in a negative way. This should be used for experiments and in the test environment only.

It’s interesting reading, though do heed that warning.  This also isn’t a quick operation (seeing as how the database engine is creating filtered statistics), so it’s not a first-best choice.  But worth keeping your back pocket.

Plotting From SQL Server Machine Learning Services

Kevin Feasel



Marlon Ribunal has a quick demo showing how to generate a ggplot2 plot using SQL Server Machine Learning Services:

Let’s install the package if it hasn’t been installed yet. The easiest way to do that is to run RGUI.exe that came with your SQL Server 2017 In-Database Machine Learning installation. You can find it here:


Take note that you need to run the executable as Administrator. Also, if you’ve installed the R engine prior to your SQL Server 2017 In-Database Machine Learning with R, you have to explicitly tell the R package installer where you want your package installed.

> install.packages("ggplot2", lib="C:\\Program Files\\Microsoft SQL Server\\MSSQL14.MSSQLSERVER\\R_SERVICES\\library", dep = TRUE)

dep = TRUE tells the installer to install dependencies. ggplot2 depends on a lot of other packages. You can check dependencies using MiniCRAN.

Another option for installation is to bootstrap install via T-SQL:  you can execute external scripts which run install.packages() directly rather than using RGUI, if that makes more sense with your deployment process.

Stored Procedure To Build Filegroups And Files

Tracy Boggiano has a lengthy script which dynamically builds out files and filegroups for you:

Here are the six different ways you can run this procedure to files and filegroups for your database:

  1. Set up partitions with a set size for each file

  2. Set up partitions based on the size of a table

  3. By monthly date range with a set size for each file

  4. By monthly date range with sized based on the size of a table

  5. One file with a set size

  6. One file based on the size of a table

She then has examples for each call, so check it out.

The Value Of Spinlocks

Jeremiah Peschka walks us through the concepts behind spinlocks:

Here’s our fist stab at a spinlock, badly written in C:

// lock_value is an integer that's shared between multiple threads
while (lock_value != 0) { // spin
lock_value = 1;
lock_value = 0;

The general idea here is correct – we have some lock_value and we only want to allow processes into the critical section of code if the another process doesn’t “hold the lock.” Holding the lock means that lock_value is set to a non-zero value.

There are few things that make this code bad. These things, coincidentally, are part of what make concurrent programming difficult.

Spinlocks are a critical part of maintaining internal consistency, but they can also accidentally ruin performance on your system if done wrong.  Read the whole thing.

“Failed To Delete Database” In Azure

Kevin Feasel



Arun Sirpal walks us through a reason why you might not be able to delete an Azure SQL Database:

Failed to delete the database: TestDB. ErrorCode: 400 ErrorMessage: Database ‘TestDB’ cannot be deleted because it is used as a sync metadata database which still contains sync groups and/or sync agents.

After some digging around I remembered that I tried to setup a data sync group where this database was involved in bi-directional synchronization with another Azure SQL Database. The other database was deleted a long time ago and clicking on the sync option within the portal confirmed that nothing was actually ever connected or ready.

Arun walks us through the wrong way before getting us down the right path.  I recommend the latter.

Searching For Valid Characters For T-SQL Regular Identifiers

Kevin Feasel



Solomon Rutzky is on a mission to find a national treasure:

Quite often these types of things are not that easy. Yes, it is very tempting to assume that the limited test is good enough and that we did indeed find the exact list of characters (plus we would need to add in the four extra characters: at sign (@), dollar sign ($), number sign (#), and underscore (_)). However, based on my experiences, it seems that more often than not, doing an exhaustive test results in a slightly different answer that invalidates the previous conclusion (which was based on the limited test). So, while it does take more time to do more extensive testing, it seems like we have little choice if we truly want to know how these things actually work.

What that means is, at the very least, we need to get the complete list of characters accepted by SQL Server for non-delimited identifiers to make sure that the totals match the number of code points returned by the searches done in Step 1.

This post is an interesting dive into the oddities of Unicode, but leaves us on a cliffhanger.  Also, full-crazy Nicolas Cage beats mullet-wearing Tom Hanks any day.


April 2018
« Mar