Press "Enter" to skip to content

Author: Kevin Feasel

Tips When Writing Extended Events To Files

Jason Brimhall has some tips to help you use the file target in Extended Events:

This first little tip comes from a painful experience. It is common sense to only try and create files in a directory that exists, but sometimes that directory has to be different on different systems. Then comes a little copy and paste of the last code used that worked. You think you are golden but forgot that one little tweak for the directory to be used. Oops.

Read on to see how SQL Server exposes that error, and then Jason shows us a different how-not-to with file targets.

Comments closed

Parallel Processing With The Pool Object In Python

Sanjay Kumar takes us through parallel processing in Python:

The parallel processing holds two varieties of execution: Synchronous and Asynchronous.
In synchronous execution, once a process starts execution, it puts a lock over the main program until its get accomplished.
While the asynchronous execution doesn’t require locking, it performs a task quickly but the outcome can be in the rearranged order.

Click through for a few examples using Pool.

Comments closed

Training A Text Classifier Against Books

Julia Silge builds a text classifier to differentiate Pride and Prejudice from War of the Worlds:

Now it’s time to train our classification model! Let’s use the glmnet package to fit a logistic regression model with LASSO regularization. It’s a great fit for text classification because the variable selection that LASSO regularization performs can tell you which words are important for your prediction problem. The glmnet package also supports parallel processing with very little hassle, so we can train on multiple cores with cross-validation on the training set using cv.glmnet().

Hot take: Jane Austen was the best English-language novelist of the 19th century. I’d say “all-time” but the world isn’t ready for a take that hot.

Comments closed

Load Multiple Input Data Sets For ML Services

Niels Berglund shows us a way to get more than one input data set passed into SQL Server Machine Learning Services:

This post came about due to a question on the Microsoft Machine Learning Server forum. The question was if there are any plans by Microsoft to support more the one input dataset (@input_data_1) in sp_execute_external_script. My immediate reaction was that if you want more than one dataset, you can always connect from the script back into the database, and retrieve data.
However, the poster was well aware of that, but due to certain reasons he did not want to do it that way – he wanted to push in the data, fair enough. When I read this, I seemed to remember something from a while ago, where, instead of retrieving data from inside the script, they pushed in the data, serialized it as an output parameter and then used the binary representation as in input parameter (yeah – this sounds confusing, but bear with me). I did some research (read Googling), and found this StackOverflow question, and answer. So for future questions, and for me to remember, I decided to write a blog post about it.

This has been a point of frustration for me. We can name the one input data set, so I’d really like to see true support for input multiple data sets without the need for hacks.

Comments closed

Returning NULL on NULL Input In UDFs

Jonathan Kehayias shows us a performance improvement you can get if your user-defined function is expected to return NULL if you pass in NULLs for inputs:

I was really curious about the RETURNS NULL ON NULL INPUT function option so I decided to do some testing. I was very surprised to find out that it’s actually a form of scalar UDF optimization that has been in the product since at least SQL Server 2008 R2.
It turns out that if you know that a scalar UDF will always return a NULL result when a NULL input is provided then the UDF should ALWAYS be created with the RETURNS NULL ON NULL INPUT option, because then SQL Server doesn’t even run the function definition at all for any rows where the input is NULL – short-circuiting it in effect and avoiding the wasted execution of the function body.

The more often you pass in NULL to that function, the better your performance will be relative to the default case.

Comments closed

Exporting To Text With SQL Server: Comparing Methods

Phil Factor shows us several ways of exporting data from SQL Server to files and gives us size and time comparisons:

I enjoy pulling the data out of AdventureWorks. It is a great test harness. What is the quickest way of doing it? Well, everyone knows it is native BCP, but how much faster is that format than tab-delimited or comma-delimited BCP? Can we quickly output data in XML? Is there a way of outputting array-in-array JSON reasonably quickly? Of course, the answer is going to vary from system to system, and across versions, but any data about this is usually welcome.
In addition to these questions, I wanted to know more about how much space these files take up, either raw or zipped. We’re about to find out. We’ll test all that, using good ol’ BCP and SQLCMD.
My motivation for doing this was to explore ways of quickly transferring data to MongoDB. to test out a way of producing array-in-array JSON at a respectable turn of speed. It turned out to be tricky. The easy and obvious ways were slow.

As is usual for Phil, this article is done quite well.

Comments closed

Generating Tally Tables With Snowflake DB

Koen Verbeeck shows us how to use the GENERATOR function to build a tally table in Snowflake:

I’m starting a little series on some of the nice features/capabilities in Snowflake (the cloud data warehouse). In each part, I’ll highlight something that I think it’s interesting enough to share. It might be some SQL function that I’d really like to be in SQL Server, it might be something else.
In the first part: the GENERATOR function. In short, it lets you generate a virtual table with a specified number of rows, or in database lingo: a tally table (or numbers table). A tally table can help you solve a lot of problems in SQL, but the TL;DR version is that it replaces loops/cursors most of the time and allows you to tackle the issue in a true set-based manner.

Naturally, as I read the article, I got the Bad Religion song stuck in my head. That’s an occupational hazard, I suppose.

Comments closed

Power BI Tab Order

Meagan Longoria shows us how to set the tab order for our Power BI elements:

Tab order is the order in which users interact with the items on a page using the keyboard. Generally, we want tab order to be predictable and to closely match the visual order on the page (unless there is a good reason to deviate). If you press the tab key in a Power BI report, you might be surprised at the seemingly random order in which you move from visual to visual on the page. The default order is the order in which the visuals were placed on the page in Power BI Desktop, or the last modified order in PowerBI.com if you have edited your report there.

I had not actually tried to tab through a dashboard in Power BI, so this is news to me.

Comments closed

Reasons Why We Get Identity Column Gaps

Steve Jones walks through several reasons why you might see gaps in identity columns:

Deleting Rows
This is noted in the tweet as a cause, but let’s test this.
One of the common ways that we get gaps in identity values is when rows are deleted. Let’s remove the row with Steve in it.

Steve explains a few others, but even that’s not complete: identity columns can jump after the service restarts as well.

In short, please do not use identity values in cases where you need to guarantee sequentiality (like check numbers or invoice numbers). Don’t use sequences either, as they’ll behave similarly.

Comments closed

Vectorization With Apache Hive And Parquet Tables

Vihang Karajgaonkar, et al, take us through using a performance improvement in Apache Hive using Parquet tables:

The performance benchmarks on CDH 6.0 show that enabling Parquet vectorization significantly improves performance for a typical ETL workload. In the test workload (TPC-DS), enabling parquet vectorization gave 26.5% performance improvement on average (geomean value of runtime for all the queries). Vectorization achieves these performance improvements by reducing the number of virtual function calls and leveraging the SIMD instructions on modern processors. A query is vectorized in Hive when certain conditions like supported column data-types and expressions are satisfied. However, if the query cannot be vectorized its execution falls back to a non-vectorized execution. Overall, for workloads which use the Parquet file format on most modern processors, enabling Parquet vectorization can lead to better query performance in CDH 6.0 and beyond.

This is worth looking into, especially if you are on the Cloudera stack.

Comments closed