Press "Enter" to skip to content

Curated SQL Posts

Debugging Spark In HDInsight

Sajib Mahmood gives various methods for debugging Spark applications running on an HDInsight cluster:

Spark Application Master

To access Spark UI for the running application and get more detailed information on its execution use the Application Master link and navigate through different tabs containing more information on jobs, stages, executors and so on.

These methods also apply for on-prem Spark clusters, although the resource locations might be a little different.

Comments closed

Parallel Stats Sampling

SQL Scotsman shows which statistics-building operations are parallel and which are single-threaded:

“Starting with SQL Server 2016, sampling of data to build statistics is done in parallel, when using compatibility level 130, to improve the performance of statistics collection. The query optimiser will use parallel sample statistics whenever a table size exceeds a certain threshold.”

As per the previous demos, prior to SQL Server 2016, the gathering of sampled statistics are serial, single-threaded operations.  Only statistic operations gathered using a full scan would qualify for parallelism.  Now in SQL Server 2016, all automatic and manual statistic operations qualify for parallelism.

He also has a neat trick for invalidating stats on a large table, so check out this article-length blog post.

Comments closed

Understanding Azure SQL Elastic Pool

Vincent-Philippe Lauzon explains how SQL Elastic Pools work and why we might want to use them in Azure:

Along came Elastic Pool.  Interestingly, Elastic Pools brought back the notion of a centralized compute shared across databases.  Unlike on premise SQL Server on premise though, that compute doesn’t sit with the server itself but with a new resource called an elastic pool.

This allows us to provision certain compute, i.e. DTUs, to a pool and share it across many databases.

His example is using a large number of small databases, where the total load is never the sum of individual expected loads.  Another reason to use a pool is for cross-database queries in Azure.

Comments closed

Power BI Alerts

Nicolo Grando shows how to create an alert in Power BI when a measure reaches a certain mark:

Set alerts to notify you when data in your dashboards changes beyond limits you set. Alerts work for numeric tiles featuring cards and gauges. Only you can see the alerts you set, even if you share your dashboard. Data alerts are fully synchronized across platforms.

How to do it?

That’s useful for turning Power BI dashboards into partial alerting systems.

Comments closed

Stats Histogram DMV

Erik Darling looks at a new DMV in vNext CTP 1.1:

It’s not exactly perfect

For instance, if you just let it loose without filters, you get a severe error. The same thing happens if you try to filter on one of the columns in the function, rather than a column in sys.stats, like this.

Very cool.  It’s one step closer to us removing our dependencies on DBCC SHOW_STATISTICS.

Comments closed

Deploying VMs To Azure Using Powershell

Rob Sewell shows how to use Powershell to create your own Azure VM instance of the Microsoft data science virtual machine:

First, an annoyance. To be able to deploy Data Science virtual machines in Azure programmatically  you first have to login to the portal and click some buttons.

In the Portal click new and then marketplace and then search for data science. Choose the Windows Data Science Machine and under the blue Create button you will see a link which says “Want to deploy programmatically? Get started” Clicking this will lead to the following blade.

Click through for a screenshot-laden explanation which leaves you with a working VM in Azure.

Comments closed

Identity Column Rollback

David Alcock figures out how identity columns behave when transactions get rolled back:

Identity columns are a very commonly used feature within tables in SQL Server. Basically when specified as an identity a column will automatically increment by the specified value; so if we have an identity increment of 1 and insert 5 rows they will automatically be numbered 1 to 5.
One cautionary measure with identities is that they don’t reset themselves when rows are deleted. If we delete rows 4 and 5 the next row will still be populated as identity 6. That’s fine, but what happens if we rollback an insert.

Read on for the answer.

Comments closed

Partition Handling In Spark 2.1

Eric Liang, et al, discuss a change to Spark 2.1 which will make certain partitioned table access faster:

In Spark 2.1, we drastically improve the initial latency of queries that touch a small fraction of table partitions. In some cases, queries that took tens of minutes on a fresh Spark cluster now execute in seconds. Our improvements cut down on table memory overheads, and make the SQL experience starting cold comparable to that on a “hot” cluster with table metadata fully cached in memory.

This looks like a nice improvement in Spark.

Comments closed

Power BI On-Prem In 2017

Paul Turley points out a blog post from the Reporting Services team:

When will we have this next Technical Preview?

We’re targeting January 2017 to release this next Technical Preview.

What’s the release vehicle for a production-ready version?

We plan to release the production-ready version in the next SQL Server release wave. We won’t be releasing it in a Service Pack, Cumulative Update, or other form of update for SSRS 2016.

When will we have a production-ready version?

We’re targeting availability in mid-2017.

That makes it sound like they’re pushing it to coincide with the vNext release.

Comments closed

Multidplyr

Matt Dancho shows how to use multidplyr to perform parallel processing on data cleansing activities:

There’s nothing more frustrating than waiting for long-running R scripts to iteratively run. I’ve recently come across a new-ish package for parallel processing that plays nicely with the tidyverse: multidplyr. The package has saved me countless hours when applied to long-running, iterative scripts. In this post, I’ll discuss the workflow to parallelize your code, and I’ll go through a real world example of collecting stock prices where it improves speed by over 5X for a process that normally takes 2 minutes or so. Once you grasp the workflow, the parallelization can be applied to almost any iterative scripts regardless of application.

This is a longer article, but if you’re using dplyr with R today, it’s worth a read.

Comments closed