Press "Enter" to skip to content

Curated SQL Posts

Sparklyr On HDInsight

Ali Zaidi has a walkthrough on using sparklyr on HDInsight:

The majority of Spark is written in Scala (~80% of Spark core), which is a functional programming language. Functional programming languages emphasize functional purity (the output only depends on the inputs) and strive to avoid side-effects. One important component of most functional programming languages is their lazy evaluation. While it might seem odd that we would appreciate laziness from our computing tools, lazy evaluation is an effective way of ensuring computations are evaluated in the most efficient manner possible.

Lazy evaluation allows Spark SQL to highly optimize the queries. When a user submits a query to Spark SQL, Spark composes the components of the SQL query into a logical plan. The logical plan is basically a recipe Spark SQL creates in order to evaluate the desired query. Spark SQL then submits the logical plan to its highly optimized engine called Catalyst, which optimizes this plan into a physical plan of action that is executed inside Spark computation engine (a series of coordinating JVMs).

Read on for more description and code.

Comments closed

Performance Of IN

Daniel Janik looks at how the IN clause behaves differently based on the number of items in the list:

As you can see the second query is much slower and the extra value in the IN caused late filtering. This is a limitation on some types of operators such as this clustered index scan.

There isn’t just a limitation of 15 input values. There’s also one at 64. On the 65th input value the list will be converted to a constant scan which is then sorted and joined. Most interestingly enough is that the list in my demo query is already sorted ascending.

Read the whole thing.

Comments closed

Bulk Administration

Kenneth Fisher discusses the bulk administration right:

So as with all permissions we only grant them if there is an actual need right? And the best practice of least privilege says that if someone has to be able to do a bulk load on a table then we should grant the bulk load to that one table right? There’s the rub. Bulk admin permissions are at the instance level and are not granular in any way. Ie you can’t grant it specifically to a single database or table. It’s all or nothing.

Read on for Kenneth’s thoughts.

Comments closed

SSRS + Power BI Desktop

Andrew Peterson walks through the steps to check out the SSRS 2016 preview which supports Power BI Desktop:

SSRS 2016 supporting Power BI Desktop reports is now in preview on Azure. But for many of us, we’d rather be able to review this in our own virtual environment, and more specifically – VirtualBox. We’ll now you can.

Our starting point was a blog posting my Microsoft employee Christopher Finlan outlining the steps needed to setup this preview in a Hyper-V environment. A great start, but what we wanted was the ability to run it Virtual Box. Fortunately for us, running the downloaded VHD in VirtualBox is much easier than Hyper-V.

Click through for the instructions.

Comments closed

Brackets Don’t Improve Performance

Jay Robinson shows that wrapping identifiers with brackets does nothing for performance:

Anyway, this obsession had me thinking – does wrapping identifiers in square brackets save SQL Server any time? Does it say to the optimizer, “Hey, I PROMISE this whole thing inside these square brackets is an identifier. Cross my heart.” And the optimizer takes your code at its word and doesn’t look through its list of reserved keywords for one that matches AccountCreateDate or address_line_2?

The answer is… no. Throwing every identifier into square brackets doesn’t speed it up at all. Here’s the test:

Read on for the test.

Comments closed

Polybase External Data Source To Hadoop

I take a look at connecting to a Hadoop cluster for Polybase:

There are a couple of things I want to point out here.  First, the Type is HADOOP, one of the three types currently available:  HADOOP (for Hadoop, Azure SQL Data Warehouse, and Azure Blob Storage), SHARD_MAP_MANAGER (for sharded Azure SQL Database Elastic Database queries), and RDBMS (for cross-database Elastic Database queries on Azure SQL Database).

Second, the Location is my name node on port 8020.  If you’re curious about how we figure that one out, go to Ambari (which, for me, is http://sandbox.hortonworks.com:8080) and go to HDFS and then Configs.  In the Advanced tab, you can see the name node:

There are different options available for different sources, but this post is focused on Hadoop.

Comments closed

The Value Of Unused Indexes

Erik Darling provides a scenario in which an index which does not get used in an execution plan can nonetheless help query performance:

We can see an example of this with unique indexes and constraints, but another possibility is that the created index had better statistical information via the histogram. When you add an index, you get Fresh Hot Stats, whereas the index you were using could be many modifications behind current for various reasons. If you have a big table and don’t hit auto-update thresholds often, if you’re not manually updating statistics somehow, or if you’re running into ascending key weirdness. These are all sane potential reasons. One insane potential reason is if you have autocreate stats turned off, and the index you create is on a column that didn’t have a statistics object associated with it. But you’d see plan warnings about operators not having associated statistics.

Again, we’re going to focus on how ADDING an index your query doesn’t use can help. I found out the hard way that both unique indexes and constraints can cease being helpful to cardinality estimation when their statistics get out of date.

This is sort of like a triple bank shot solution:  even if it works that one time, there are easier ways to do it—and those ways are more likely to succeed to boot.

Comments closed

Comments And Performance

Aaron Bertrand looks at whether comments affect query performance:

Every once in a while, a conversation crops up where people are convinced that comments either do or don’t have an impact on performance.

In general, I will say that, no, comments do not impact performance, but there is always room for an “it depends” disclaimer.

I’m glad that there’s no appreciable difference.  Even if there were, good comments are valuable enough to make me not care about performance implications.  But fortunately, that’s not a trade-off I have to make.

Comments closed

Powershell Cmdlets For SSRS

Aaron Nelson reports that there are now Powershell cmdlets for SQL Server Reporting Services:

I have been testing these commands for several weeks and so far my favorite command is Write-RsFolderContent because it will allows you to write the .RDL & .RSD files from a directory on your machine to your SSRS folder. Like the whole thing. You don’t have to throw it into a loop or anything. Try it out!

This is a wonderful replacement for the old RSScripter app (of which I still have a copy squirreled away somewhere).

Comments closed