Setting Up SparklyR In Azure

David Smith shows how you can spin up a Spark cluster in Azure and install SparklyR on top of it:

The SparklyR package from RStudio provides a high-level interface to Spark from R. This means you can create R objects that point to data frames stored in the Spark cluster and apply some familiar R paradigms (like dplyr) to the data, all the while leveraging Spark’s distributed architecture without having to worry about memory limitations in R. You can also access the distributed machine-learning algorithms included in Spark directly from R functions.

If you don’t happen to have a cluster of Spark-enabled machines set up in a nearby well-ventilated closet, you can easily set one up in your favorite cloud service. For Azure, one option is to launch a Spark cluster in HDInsight, which also includes the extensions of Microsoft ML Server. While this service recently had a significant price reduction, it’s still more expensive than running a “vanilla” Spark-and-R cluster. If you’d like to take the vanilla route, a new guide details how to set up Spark cluster on Azure for use with SparklyR.

Read on for more details.

Apache NiFi 1.5 Updates

Tim Spann shows off some nice additions to Apache NiFi:

Another cool processor that I will talk about in greater detail in future articles is the much-requested Spark Processor. The ExecuteSparkInteractive processor with its Livy Controller Service gives you a much better alternative to my hacky REST integration to calling Apache Spark batch and machine learning jobs.

There are a number of enhancements, new processors, and upgrades I’m excited about, but the main reason I am writing today is because of a new feature that allows for having an Agile SDLC with Apache NiFi. This is now enabled by Apache NiFi Registry. It’s as simple as a quick git clone or download and then, you’ll use Apache Maven to install Apache NiFi Registry and start it. This process will become even easier with future Ambari integration for a CLI-free install.

To integrate the Registry with Apache NiFi, you need to add a Registry Client. It’s very simple to add the default local one — see below.

There are several new features in the latest release.

How Meltdown And Spectre Have Affected Spark Performance

Chris Stevens, et al, show how DAtabricks customers have fared in a post-Meltdown+Spectre world:

On AWS, we have observed a small performance degradation up to 5% since January 4th. On i3-series instance types, where we cache data on the local NVMe SSDs (Databricks Cache), we have observed a degradation up to 5%. On r3-series instance types, in which the benchmark jobs read data exclusively from remote storage (S3), we have observed a smaller increase of up to 3%. The greater percentage slowdown for the i3 instance type is explained by the larger number of syscalls performed when reading from the local SSD cache.

The chart below shows before and after January 3rd in AWS for a r3-series (memory optimized) and i3-series (storage optimized) based cluster.  Both tests fixed to the same runtime version and cluster size. The data represents the average of the full benchmark’s runtime per day, for a total of 7 days prior to January 3 (before is in blue) and 7 days after January 3 (after is in red). We exclude January 3rd to prevent partial results.  As mentioned, the i3-series has the Databricks Cache enabled on the local SSDs, resulting in roughly half of the total execution time (faster) compared to the r3-series results.

Overall, they’re seeing a degredation of 2-5%.  Click through for some more information on how they collected their metrics.

Comparing Data Lake Job Runs

Yanan Cai shows how to compare stats on different executions of a job:

Troubleshooting issues in recurring job is a time-consuming task. It starts with searching through the Job Browser to find instances of a recurring job and identifying both baseline and anomalous performance. This is followed by multi-way comparisons between job instances to figure out what has been changed in the query, data or environment. This is followed by analysis to discover which changes may have performance impact. While this is happening production workloads continue to under-perform or go down.

Azure Data Lake Tools for Visual Studio now makes it easy to spot anomalies and quickly trace the key characteristics across recurring job instances allowing for an efficient debugging experience. The Pipeline Browser automatically groups recurring jobs to simplify discovery of all runs. The Related Job View collects data about inputs, outputs and execution across multiple runs into a single visualization.

Read on for more.

Out Of Memory Conditions And In-Memory OLTP

Ned Otter explains how a memory-optimized database can run out of its most critical resource:

What can cause a memory-optimized database to run out of memory? It could be that resource consumption (memory) exceeded:

  • the relevant percentage of committed_target_kb from the sys.dm_os_sys_info DMV (explained in a moment)
  • MAX_MEMORY_PERCENT value of a Resource Pool that the database is bound to (if running Enterprise Edition and using Resource Governor)


  • garbage collection is not operational (the purpose of GC is to reclaim memory consumed by stale row versions)
  • updates to memory-optimized table variables caused row versions to be created, and because GC does not operate on table variables, you ran out of memory (for table variables that have a very large amount of rows)

The only thing that can prevent GC from working is a long running transaction.

This is the final post of Ned’s resource issues miniseries and it’s been a good one.

Visual Principles And Dashboards

I continue my series on dashboard visualization by looking at pictures:

In a bit more detail, you can make a dashboard glanceable by following these guidelines:

  1. Ensure that there is clear purpose in your metric design and display.  In other words, think about which metrics you want to show, how you want to show them, and where you put metrics in relation to one another.

  2. Group metrics by function into sections.  Look at the dashboard above.  It has four clusters of metrics:  those around revenue, new customers, revenue per customer, and customer acquisition cost.  All of the revenue metrics are clustered in the top-left quadrant of the dashboard.  Furthermore, all revenue-related metrics (that is, revenue metrics and revenue per customer metrics) are on the left-hand side of the dashboard, so the CEO can focus on that half and learn about revenue and revenue per customer.  She doesn’t need to look in the top-left corner for one revenue measure and in the bottom right for another; she can focus down to a portion of the dashboard and get an answer.

  3. It should be easy to see and differentiate those clusters of metrics.  Our natural instinct might be to put borders around the clusters, but whitespace is your friend—remember, less is more.  If you add a bit more whitespace between clusters of measures, you’ll make it easy for people to see that there’s a difference without distracting them with unnecessary lines.

I cover the Rule of Thirds, Glanceability, and Color Vision Deficiency, three important considerations for a designer.

SQL Server Backups To Azure Blob Storage

Kevin Hill shows  how to configure SQL Server to back up a database to Azure blob storage:

Note the “no blobs found” in the container.  After a successful backup, you will see it here.

Click on ‘Container Properties’ to get the URL for this specific container…this will be used in Backup and Restore statements.  Click the button next to the URL to copy it.  For now just remember where this is or copy it to Notepad, Query window etc.  When we start to build our T-SQL statements, we will need both the Access key from earlier and the URL.

Kevin gives clear, step by step instructions on the process.

Specifying IP Address On A Point To Site VPN

Kevin Feasel



John Morehouse shows how to force a particular IP address when building an Azure point-to-site VPN:

Recently, I got to work with a client on something interesting. We implemented transactional replication to send data to an Azure virtual machine.  This was being done to perform some testing for a project.

Given that the two machines were NOT within the same Active Directory domain, we wanted to make sure our client’s data was protected, so we utilized a Point-to-Site VPN to facilitate this.  With the client using a VPN connection, this helps to ensure that any data transmitted to the virtual machine is encrypted and secured.  Note, the process on how to configure and implement the VPN connection is for another blog post.

SQL Server replication requires the use of a server name rather than just the IP addresses. This meant that the virtual machine in Azure had to use an entry in the local host file that was pointed back to the client’s machine.

The down side?  When the VPN connection drops (it happens), the client machine obtains a new IP address upon reconnecting.  Potentially now the host file would have the incorrect IP address and needs to be updated.

Read on to see how John was able to solve this.


Mark Wilkinson shows off some fun stuff you can do with the OUTPUT clause:

A common command in the Linux world is the tee command. What tee allows you to do is pipe the output of a command to a file as well as the console. This same functionality can be implemented using multiple OUTPUT clauses in a T-SQL statement. In this example we are going to update a few hundred records. When the update statement is run, not only will it update the MyGuid table but it will update a log table and also return the result of the update. This is accomplished by using two OUTPUT clauses.

Check it out.  I don’t use OUTPUT that often, but it can be quite useful when in a pinch or if you want to prevent scanning a table twice.

Getting dbatools To Version 1.0

Simone Bizzotto explains what it’s going to take to get dbatools up to version 1.0:

We’re looking for contributors to help us finally reach version 1.0. Currently, we are on par with Gmail’s beta schedule: a whopping 4 years. But, we’re almost there and need your help finalizing our changes. If you’re interested in helping us bring 1.0 alive, we identified four areas with 5 primary contacts on the SQL Server Community Slack:

  • Standardize param names (@wsmelton)
  • Create tests for existing functions (@cl and @niphlod)
  • Review existing function documentation (@alevyinroc or @gbargsley)
  • Prepare for 1.0 with “code style” (Bill of Health, more on that later)

As you can see, a few of us are the main reference (on GitHub and Slack, mostly) for each area.

Read the whole thing and, if you’ve found dbatools to be helpful in the past, see if there’s anything you can do to help them out a little in return.


January 2018
« Dec Feb »