Connecting Apache Drill To Power BI

Bryan Smith shows how to connect Apache Drill to Power BI:

Clicking Next takes me to the From ODBC dialog.  Here, I click on the Advanced options item, ignoring the Data Source Name (DSN) drop-down, and enter a connection string with the appropriate substitution for the host parameter:

driver={MapR Drill ODBC Driver};connectiontype=Direct;;port=31010;authenticationtype=No Authentication

Notice the connection string employs a Direct connection type, indicating that the app will speak directly to one of the nodes in the cluster (as identified by the host parameter) and not to the ZooKeeper service. ZooKeeper is in use on the cluster but is not exposed externally, given the network security group changes made during my earlier deployment.  Even if ZooKeeper were exposed, it tracks the nodes of the cluster using their internal names so that any app outside the virtual network containing the cluster would not be able to leverage the information in ZooKeeper to form a connection.  The only option that works here is the Direct connection type.

It’s worth reading the whole thing, as well as checking out the UserVoice suggestion for implementing full Apache Drill support.

Debugging Spark In HDInsight

Sajib Mahmood gives various methods for debugging Spark applications running on an HDInsight cluster:

Spark Application Master

To access Spark UI for the running application and get more detailed information on its execution use the Application Master link and navigate through different tabs containing more information on jobs, stages, executors and so on.

These methods also apply for on-prem Spark clusters, although the resource locations might be a little different.

Parallel Stats Sampling

SQL Scotsman shows which statistics-building operations are parallel and which are single-threaded:

“Starting with SQL Server 2016, sampling of data to build statistics is done in parallel, when using compatibility level 130, to improve the performance of statistics collection. The query optimiser will use parallel sample statistics whenever a table size exceeds a certain threshold.”

As per the previous demos, prior to SQL Server 2016, the gathering of sampled statistics are serial, single-threaded operations.  Only statistic operations gathered using a full scan would qualify for parallelism.  Now in SQL Server 2016, all automatic and manual statistic operations qualify for parallelism.

He also has a neat trick for invalidating stats on a large table, so check out this article-length blog post.

Understanding Azure SQL Elastic Pool

Kevin Feasel



Vincent-Philippe Lauzon explains how SQL Elastic Pools work and why we might want to use them in Azure:

Along came Elastic Pool.  Interestingly, Elastic Pools brought back the notion of a centralized compute shared across databases.  Unlike on premise SQL Server on premise though, that compute doesn’t sit with the server itself but with a new resource called an elastic pool.

This allows us to provision certain compute, i.e. DTUs, to a pool and share it across many databases.

His example is using a large number of small databases, where the total load is never the sum of individual expected loads.  Another reason to use a pool is for cross-database queries in Azure.

Power BI Alerts

Nicolo Grando shows how to create an alert in Power BI when a measure reaches a certain mark:

Set alerts to notify you when data in your dashboards changes beyond limits you set. Alerts work for numeric tiles featuring cards and gauges. Only you can see the alerts you set, even if you share your dashboard. Data alerts are fully synchronized across platforms.

How to do it?

That’s useful for turning Power BI dashboards into partial alerting systems.

Stats Histogram DMV

Erik Darling looks at a new DMV in vNext CTP 1.1:

It’s not exactly perfect

For instance, if you just let it loose without filters, you get a severe error. The same thing happens if you try to filter on one of the columns in the function, rather than a column in sys.stats, like this.

Very cool.  It’s one step closer to us removing our dependencies on DBCC SHOW_STATISTICS.

Deploying VMs To Azure Using Powershell

Rob Sewell shows how to use Powershell to create your own Azure VM instance of the Microsoft data science virtual machine:

First, an annoyance. To be able to deploy Data Science virtual machines in Azure programmatically  you first have to login to the portal and click some buttons.

In the Portal click new and then marketplace and then search for data science. Choose the Windows Data Science Machine and under the blue Create button you will see a link which says “Want to deploy programmatically? Get started” Clicking this will lead to the following blade.

Click through for a screenshot-laden explanation which leaves you with a working VM in Azure.

Identity Column Rollback

David Alcock figures out how identity columns behave when transactions get rolled back:

Identity columns are a very commonly used feature within tables in SQL Server. Basically when specified as an identity a column will automatically increment by the specified value; so if we have an identity increment of 1 and insert 5 rows they will automatically be numbered 1 to 5.
One cautionary measure with identities is that they don’t reset themselves when rows are deleted. If we delete rows 4 and 5 the next row will still be populated as identity 6. That’s fine, but what happens if we rollback an insert.

Read on for the answer.


December 2016
« Nov Jan »