Press "Enter" to skip to content

Category: Self-Promotion

Where Polybase Stats Live

I dig into where the statistics against a Polybase table actually live:

Today, we learned that Polybase statistics are stored in the same way as other statistics; as far as SQL Server is concerned, they’re just more statistics built from a table (remembering that the way stats get created involves loading data into a temp table and building stats off of that temp table).  We can do most of what you’d expect with these stats, but beware calling sys.dm_db_stats_properties() on Polybase stats, as they may not show up.

Also, remember that you cannot maintain, auto-create, auto-update, or otherwise modify these stats.  The only way to modify Polybase stats is to drop and re-create them, and if you’re dealing with a large enough table, you might want to take a sample.

The result isn’t very surprising in retrospect, and it’s good that “stats are stats are stats” is the correct answer.

Comments closed

Configuring Polybase

I have a blog post up on configuring Polybase:

Microsoft’s next recommendation is to make sure that predicate pushdown is enabled.  To do that, we’re going to go back to the Hadoop VM and grab our yarn.application.classpath from there.  To do that, cd to /etc/hadoop/conf/ and vi yarn-site.xml (or use whatever other text reader you want).  Copy the value for yarn.application.classpath, which should be a pretty long string.  Mine looks like:


Now that you have a copy of that value, go to your SQL Server installation directory (by default, C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\Binn\Polybase\Hadoop\conf) and open up yarn-site.xml.  Paste the value into the corresponding yarn.application.classpath setting and you’re good to go.

This is part one of a series on using Polybase.

Comments closed

SQL Data Partners Podcast: The Wide World Of Data

Carlos L. Chacon was nice enough to interview me on his podcast:

The expansion of data sets and increased expectations of businesses for analysis and modeling of data has led developers to create a number of database products to meet those needs. As data professionals, it is incumbent upon us to understand how these tools work and put them to their best use–before somebody else puts them to sub-optimal use.  I am joined by Kevin Feasel who walks us through some of the technologies available and sorts out under what circumstances we want to consider using each one.

Show notes are on the SQL Data Partners podcast site.  My presentation slides are available online.  And if I get just a few more people to dig Aphyr as much as I do, the world will be a better place.

Comments closed


I walk through a scenario which underscores the importance of normalization:

This joins my records to a tally table, which gives one row for each character in RemovedValue (that is, the numbers without recordset separators).  I then retain only the values which start a sequence, and use SUBSTRING to snatch up four digits. What I’m left with is a column named SplitVersion, which has one row for each customer, campaign, and 4-digit value (which is equivalent to my normalized table’s structure).

If that wasn’t exciting enough, we now need to slam this back together into our denormalized format, and that’s what tallyjoin does. It uses the FOR XML PATH trick to concatenate my four-digit values into one string, separated by commas. You might be wondering why I use comma instead of CHAR(30), and the answer is that converting CHAR(30) to XML returns a nasty result, so instead of trying to handle that, I use a character which is copacetic and translate it back using the REPLACE function after casting my “XML” result to varchar.

The implicit story here is, you can find someone who knows how to use tally tables, how to concatenate strings of data (quickly!), who knows how to tie various pieces of the puzzle together, and so on…or design the database the right way and avoid this pain later.

Comments closed

Reorganize Columnstore Indexes

I have a new script available to reorganize columnstore indexes:

Note that this script requires SQL Server 2016 (or later) because the database engine team made some great changes to columnstore indexes, allowing us to use REORGANIZE to clear out deleted rows and compact row groups together, as well as its previous job of marking open delta stores as available for compression.

The code is available as a Gist for now, at least until I decide what to do with it.  Comments are welcome, especially if I’m missing a major reorganize condition.

As mentioned, comments are welcome.

Comments closed

SQL Injection Blacklists Are Bad

Eli Leiba created a function to try to generate a blacklist against SQL injection:

The suggested solution presented here involves creating a user defined T-SQL scalar function that checks the input string for any suspicious key words that might indicate the SQL injection intents.

The function checks the input string against a set of pre-defined keywords that are known to be used in SQL injection cases.

I get the intent here, but blacklists don’t work.

The first line of defense that many developers come up with is a blacklist: we know that keywords like “select,” “insert,” and “drop” are necessary to perform a SQL injection attack, so if we just ban those keywords, everything should be fine, right? Alas, life is not so simple; this leads to a number of problems with blacklists in general, as well as in this particular case.

The second-biggest problem with blacklists is that they could block people from performing legitimate requests. For example, a user at a paint company’s website may wish to search for “drop cloths,” so a naïve blacklist, outlawing use of the word “drop” in a search would lead to false positives.

The biggest problem is that, unless extreme care is taken, the blacklist will still let through malicious code. One of the big failures with SQL injection blacklists is that there are a number of different white-space characters: hex 0x20 (space), 0x09 (tab), 0x0A, 0x0B, 0x0C, 0x0D, and 0xA0 are all legitimate white-space as far as a SQL Server query is concerned. If the blacklist is looking for “drop table,” it is looking for the word “drop,” followed by a 0x20 character, followed by the word “table.” If we replace the 0x20 with a 0x09, it sails right by the blacklist.

With this particular blacklist, you have a pretty high probability of false positives:  the list includes dashes, “tran,” “update,” “while,” “grant,” and even “go.”  These are tokens used in SQL injection attempts, but they’re also very common words or word segments in English.  This means that if you’re trying to blacklist a publicly-accessible search box which reads common English phrases, the incidence of false positive is going to be high enough that the blacklist changes.  But even if it doesn’t, a dedicated attacker can still get around your blacklist; as the old saying goes, the attacker only needs to be right once.

Comments closed

Migrating A VM To Azure

I have a post up on how I migrated my presentation VM to Azure:

Once sysprep was done, I needed to find a way to get the VMDK files converted to VHDs.  A blog post turned me on to StarWind Software’s V2V Converter.  It’s a free tool which allows you to convert virtual hard drive files from one format to another.  Installing this tool let me turn my set of VMDKs into one 45GB VHD.  One note is that, at least on my machine, I needed to run the V2V Converter from a command prompt; executing the app directly from the Start menu would cause the app to appear for a moment and then disappear, as though some error killed the program.  The tool installs by default in “%programfiles(x86)%\StarWind Software\StarWind V2V Image Converter\StarV2V.exe” From there, I just needed to get that big image into Azure.

This VM is really a Plan C or Plan D for me, but it’s good to have layers of redundancy.

Comments closed

Installing And Using SQL Server R Services

I have three blog posts on installing and using R in SQL Server.

First, installing SQL Server R Services:

I’m excited that CTP 3 of SQL Server 2016 is publicly available, in no small part because it is our first look at SQL Server R Services.  In this post, I’m going to walk through installing Don’t-Call-It-SSRS on a machine.

Then, using RODBC to connect a Linux machine with RStudio installed to a SQL Server instance:

Getting a Linux machine to talk to a SQL Server instance is harder than it should be.  Yes, Microsoft has a Linux ODBC driver and some easy setup instructions…if you’re using Red Hat or SuSE.  Hopefully this helps you get connected.

If you’re using RStudio on Windows, it’s a lot easier:  create a DSN using your ODBC Data Sources.

Finally, using SQL Server R Services:

So, what’s the major use of SQL Server R Services?  Early on, I see batch processing as the main driver here.  The whole point of getting involved with Revolution R is to create sever-quality R, so imagine a SQL Agent job which runs this procedure once a night against some raw data set.  The R job could build a model, process that data, and return a result set.  You take that result set and feed it into a table for reporting purposes.  I’d like to see more uses, but this is probably the first one we’ll see in the wild.

It’s a preview of a V1 product.  Keep that in mind.

The first and third posts are for CTP 3, so beware the time-sensitive material warnings.

Comments closed

Welcome To Curated SQL

Curated SQL is a simple premise:  act as a clearinghouse for high-quality SQL Server material.  Basically, we want to distill the tremendous number of articles and blog posts and make it easy for you to find the best of the web.  This is a curated site, so all posts are lovingly hand-crafted, with nary a bot to be found.

If you want to learn more, read our About page.  Also follow us on Twitter.  And then, scroll up and enjoy the show.

Comments closed