Curated SQL – Page 1431 – A Fine Slice Of SQL Server

Python supports a limited number of data types in comparison to SQL Server. As a result, whenever you use data from SQL Server in Python scripts, the data might be implicitly converted to a type compatible with Python. However, often an exact conversion cannot be performed automatically, and an error is returned. This table lists the implicit conversions that are provided. Other data types are not supported.

This article will get you started, and from there, the wide world of Anaconda awaits you.

Comments closed

Openrowset On Linux

Published 2017-06-22 by Kevin Feasel

Steve Jones shows how to use the OPENROWSET command to bulk load data into SQL Server on Linux:

I wanted to import the million song dataset in SQL Server on Linux. There’s a github repo that has the SQL to allow you to use this with the graph database features. However, it’s built for Windows.

Linux is a slightly different beast. Once I started down this path, I had memories of working on SunOS in college, messing with permissions and moving files.

I run Ubuntu in VMWare, so I first downloaded the files to my Documents folder. That’s pretty easy. However, once there, the mssql user can’t read them. Rather than mess with permissions for my home, I decided to move these to a location where the mssql user could read them.

Much of the post is about file permissions. This is because SQL on Linux is SQL on Windows, and that’s a glorious thing.

Comments closed

Persisting Containerized Data

Published 2017-06-22 by Kevin Feasel

Andrew Pruski has started a series on persisting data in Docker containers. He starts off the series with an easy method of keeping data around after you delete the container:

Normally when I work with SQL instances within containers I treat them as throw-away objects. Any modifications that I make to the databases within will be lost when I drop the container.

However, what if I want to persist the data that I have in my containers? Well, there are options to do just that. One method is to mount a directory from the host into a container.

Full documentation can be found here but I’ll run through an example step-by-step here.

Statefulness has been a tough nut to crack for containers. I’m interested in seeing what Andrew comes up with.

Comments closed

The Importance Of A Test Environment

Published 2017-06-22 by Kevin Feasel

Randolph West explains why it’s important to have a test environment separate from your development and production environments:

Some companies I’ve worked with have different forms of testing environments, including QA (Quality Assurance), IAT (Internal Acceptance Testing), and UAT (User Acceptance Testing). What they are called doesn’t matter, so long as they exist.

In a typical IT deployment, whether using Waterfall, Agile, or other development methodologies of the month, it pays to have a basic development → test → production deployment path.

Randolph explains it in some detail but one of the big benefits for me is that you can make sure that deployment process works before deployment time. Knowing that your checked-in scripts won’t break the deployment (because they didn’t break the CI build and release) makes the release process a lot less stressful.

Comments closed

Automatic Processing Of Azure Analysis Services Models

Published 2017-06-22 by Kevin Feasel

Dustin Ryan shows how to use Azure Functions to refresh Azure Analysis Services models:

Download the latest client libraries for Analysis Services. This needs to be done on your local machine so you can then copy these files to your Azure Function App.

After you’ve downloaded the client libraries, the DLLs can be found in C:\Program Files (x86)\Microsoft SQL Server\140\SDK\Assemblies. The two files you need are:

C:\Program Files (x86)\Microsoft SQL Server\140\SDK\Assemblies\Microsoft.AnalysisServices.Core.DLL
C:\Program Files (x86)\Microsoft SQL Server\140\SDK\Assemblies\Microsoft.AnalysisServices.Tabular.DLL

This step is important because the documentation in Azure references the 130 assemblies, which will not work. You need the assemblies in 140 otherwise you’ll get errors.

Dustin walks through the whole process of setting up an Azure Function step by step.

Comments closed

Finding Candidates For Memory-Optimized Tables

Published 2017-06-22 by Kevin Feasel

Ned Otter points out a very interesting report in SSMS 2016 and 2017, which helps you determine if you should migrate a table to be memory-optimized:

The chart attempts to display both the best candidates and the degree of difficulty for migration. But there are a few problems with the “difficulty” rating of this internal query, and as a result, if we really want to know how easy/difficult the process might be, we’ll have to roll our own query.

Read on for more details, as well as a script Ned has put together to do the same in T-SQL.

Comments closed

Bayesian Average

Published 2017-06-21 by Kevin Feasel

Jelte Hoekstra has a fun post applying the Bayesian average to board game ratings:

Maybe you want to explore the best boardgames but instead you find the top 100 filled with 10/10 scores. Experience many such false positives and you will lose faith in the rating system. Let’s be clear this isn’t exactly incidental either: most games have relatively few votes and suffer from this phenomenon.

The Bayesian average

Fortunately, there are ways to deal with this. BoardGameGeek’s solution is to replace the average by the Bayesian average. In Bayesian statistics we start out with a prior that represents our a priori assumptions. When evidence comes in we can update this prior, computing a so called posterior that reflects our updated belief.

Applied to boardgames this means: if we have an unrated game we might as well assume it’s average. If not, the ratings will have to convince us otherwise. This certainly removes outliers as we will see below!

This is a rather interesting article and you can easily apply it to other rating systems as well.

Comments closed

Re-Shaping Data Flows

Published 2017-06-21 by Kevin Feasel

Maneesh Varshney explains some methods to trim the fat out of analytical data flows:

Big data comes in a variety of shapes. The Extract-Transform-Load (ETL) workflows are more or less stripe-shaped (left panel in the figure above) and produce an output of a similar size to the input. Reporting workflows are funnel-shaped (middle panel in the figure above) and progressively reduce the data size by filtering and aggregating.

However, a wide class of problems in analytics, relevance, and graph processing have a rather curious shape of widening in the middle before slimming down (right panel in the figure above). It gets worse before it gets better.

In this article, we take a deeper dive into this exploding middle shape: understanding why it happens, why it’s a problem, and what can we do about it. We share our experiences of real-life workflows from a spectrum of fields, including Analytics (A/B experimentation), Relevance (user-item feature scoring), and Graph (second degree network/friends-of-friends).

The examples relate directly to Hadoop, but are applicable in other data platforms as well.

Comments closed

Spark Streaming Vs Kafka Streams

Published 2017-06-21 by Kevin Feasel

Mahesh Chand Kandpal contrasts Kafka Streams with Spark Streaming:

The low latency and an easy-to-use event time support also apply to Kafka Streams. It is a rather focused library, and it’s very well-suited for certain types of tasks. That’s also why some of its design can be so optimized for how Kafka works. You don’t need to set up any kind of special Kafka Streams cluster, and there is no cluster manager. And if you need to do a simple Kafka topic-to-topic transformation, count elements by key, enrich a stream with data from another topic, or run an aggregation or only real-time processing — Kafka Streams is for you.

If event time is not relevant and latencies in the seconds range are acceptable, Spark is the first choice. It is stable and almost any type of system can be easily integrated. In addition it comes with every Hadoop distribution. Furthermore, the code used for batch applications can also be used for the streaming applications as the API is the same.

Read on for more analysis.

Comments closed

Combinatorics With Joins

Published 2017-06-21 by Kevin Feasel

Dmitry Zaytsev explains the math behind why query plans can be so inefficient when dealing with a large number of joins:

Let’s talk about the sequence of table joins in detail. It is very important to understand that the possible number of table joins grows exponentially, not linearly. Fox example, there are only 2 possible methods to join 2 tables, and the number can reach 12 methods for 3 tables. Different join sequences can have different query cost, and SQL Server optimizer must select the most optimal method. But when the number of tables is high, it becomes a resource-intensive task. If SQL Server begins going over all possible variants, such query may never be executed. That is why, SQL Server never does it and always looks for a quite good plan, not the best plan. SQL Server always tries to reach compromise between execution time and plan quality.

There are ways you can help the optimizer, and one of my favorite query tuning books was all about table selection.

Comments closed

Curated SQL Posts

The Bayesian average