Using sparklyr

Kevin Feasel


R, Spark

Hossein Falaki and Xiangrui Meng show how to use sparklyr on a Databricks Spark cluster:

We collaborated with our friends at RStudio to enable sparklyr to seamlessly work in Databricks clusters. Starting with sparklyr version 0.6, there is a new connection method in sparklyr: databricks. When calling spark_connect(method = "databricks") in a Databricks R Notebook, sparklyr will connect to the spark cluster of that notebook. As this cluster is fully managed, you do not need to specify any other information such as version, SPARK_HOME, etc.

I’d lean toward sparklyr over SparkR because of the former’s tidyverse-centric view.

Play Axis Custom Visual

Devin Knight continues his Power BI custom visuals series:

In this module you will learn how to use the Play Axis Power BI Custom Visual.  The Play Axis visual works like a dynamic slicer that animates your other report visuals without needing to click every time you want to change your filter value.

This is a valuable custom visual when dealing with time series data, but as Devin shows, you can iterate through other sets, like a set of employee names.

Linux Administrative Basics For The SQL Server DBA

David Klee continues his SQL Server on Linux series with a discussion of basic Linux installation and usage:

You’ll want to learn the syntax for one of the console-based text editors. My personal favorite is ‘vi‘. It’s quick, streamlined, but does have a significant learning curve. Emacs is another editor that works great. Many others are out there, and your options open even more if you’re using a GUI. You’ll need an editor to edit configuration files.

The folder structure of Linux is one of the biggest changes. Whereas Windows is based off of an arbitrary drive-letter assignment system that dates back to the DOS era, Linux is is based off of a tree structure. All folders and files are based on a single point, ‘/’ or the root folder, and everything is based off of folders from this point. Certain folders from Windows, such as C:\Windows, C:\Users\username, or %WINDOWSTEMP%, are mapped to certain folders within the Linux operating system.

This is really high-level stuff; if you’re looking at administering a Linux box in a production environment, I’d highly recommend taking the time to learn Linux in detail.


Jovan Popovic points out the performance difference in using the WITH clause in an OPENJSON query:


Here are results of the queries:

SQL Server Execution Times: CPU time = 656 ms, elapsed time = 651 ms.
SQL Server Execution Times: CPU time = 204 ms, elapsed time = 197 ms.

As you can see, WITH clause specify that OPENJSON should immediately return properties from the JSON array without second parsing. Performance of the queries might be increased 3 times if you avoid double parsing.

That’s a pretty big difference when you specify the relevant data model elements.

Making Calls With IoT Hub

Kevin Feasel



Rolf Tesmer combines Azure IoT Hub with Twilio to make phone calls based on incoming messages:

When the IoT Hub is created you will get an endpoint hosted in Azure.  This is the target for the JSON events being generated from the mobile device.

Azure IoT Hubs are more complex than an Azure Event Hub, perform a lot more device based functions and also have stronger security capability.  However, operationally they work pretty much the same.  

If you want to learn more about the differences between the two Hubs then this is a great article –

It’s a neat tutorial for a fun weekend project.

Reporting Services Report Schedules

Jason Brimhall has a doozy of a query for figuring out SQL Server Reporting Services report schedules:

In pulling the data together from the two sources, I opted to return two result sets. Not just two disparate result sets, but rather two result sets that each pertained to both the agent job information as well as the ReportServer scheduling data. For instance, I took all of the subscriptions in the ReportServer and joined that data to the job system to glean information from there into one result set. And I did the reverse as well. You will see when looking at the query and data. One of the reasons for doing it this way was to make this easier to assimilate into an SSRS style report.

There’s a 680-line script ahead.

Scaling CheckTable With Respect To CPUs

Kevin Feasel



Lonny Niederstadt has a couple of posts on scaling DBCC CHECKTABLE based on degree of parallelism.  First, he looks at running the command with physical_only:

So we can use this formula when dop, elapsed_ms, and cpu_ms are known:

DOP * elapsed_ms = cpu_ms + idle_ms

That allows the 8 checktable operations to be summarized in this graph.  From DOP1 to DOP 8 the cpu_ms of the operation is extremely steady.  From DOP 1 to DOP 4 there are significant decreases in elapsed time as dop increases.  After dop 4, reduction in elapsed time is slight.  Throughout the tested range, idle_ms increased at a nearly linear rate.

In his second post, he looks at full CHECKTABLE runs and not just physical_only:

So the good news for today is that checktable operations without the physical_only option scale farther/better on my test tables than checktable with physical_only.  While with physical_only scaling benefits in elapsed time are primarily seen only to dop 4, without the physical_only option elapsed time benefits to increasing dop extend at least to dop 8.
And we saw that the shape of scalability graphs is pretty volatile 🙂  That’s largely because modest changes in elapsed time are multiplied by dop in this calculation to arrive at the idle_ms number – that idle_ms number is the one that changes shape most readily.

These are prologue posts to a discussion on the OLEDB wait type.

Cleaning Up Erroneous Data Using Temporal Tables

Bert Wagner shows how to use a temporal table to clean up incorrectly entered data if you already have a temporal table in place:

The good news is that all of your data is still intact — it’s been copied over to the historical table. Phew!

Now all you need to do is rollback this inadvertent row insertion and make your tables look just like you did before you started breaking them.

This should be easy right?

Well not exactly — there’s no automatic way to roll back the data in a temporal table. However, that doesn’t mean we can’t write some clever queries to accomplish the same thing.

Read the whole thing, as there’s a multi-step process.

Smart Transaction Log Backup Stats

Parikshit Savjani explains how you can use a new DMV to create smart transaction log backups:

In sys.dm_db_log_stats, you will find a new column log_since_last_log_backup_mb which can be used in your backup script to trigger a transaction log backup when log generated since last backup exceeds a threshold value. With smart transaction log backup, the transaction log backup size would be consistent and predictable avoiding autogrows from transactional burst activity on the database. The resulting pattern from the transaction log backup would be similar to below.

The new sys.dm_db_log_stats DMV looks to be quite useful.

Where-Object Versus Where Method

Adam Bertram explains the difference between the where-object and the where method in Powershell:

The Where-Object command is a sort of generic filtering command. Use this command to filter any kind of object in PowerShell. The Where-Object command has a FilterScript parameter, which is a scriptblock that allows the user to place code in it. If this scriptblock contains code that returns anything but $false, $null, or an empty string, it will allow whichever object the user passes to it.

For example, let’s say I’ve got a folder full of files. I’d like to see only text files and only those text files modified today. To make this happen, I can use the provider-specific filter with the Get-ChildItem command and also the Where-Object command.

Read on to see how that compares to the where method.  Given the latter’s limitations, I’ll probably stick to where-object anytime performance is not critical.