Using Sparklyr To Analyze Flight Data

Aki Ariga uses sparklyr on Apache Spark 2.0 to analyze flight data living in S3:

Using sparklyr enables you to analyze big data on Amazon S3 with R smoothly. You can build a Spark cluster easily with Cloudera Director. sparklyr makes Spark as a backend database of dplyr. You can create tidy data from huge messy data, plot complex maps from this big data the same way as small data, and build a predictive model from big data with MLlib. I believe sparklyr helps all R users perform exploratory data analysis faster and easier on large-scale data. Let’s try!

You can see the Rmarkdown of this analysis on RPubs. With RStudio, you can share Rmarkdown easily on RPubs.

Sparklyr is an exciting technology for distributed data analysis.

Wall Clocks Lie

Lonny Niederstadt digs into some strange behavior on one of his VMs:

Rats.  That was a really good idea, too.  Nothin’.  Or….
(I want to call attention to the change in maximum axis value for CPU utilization.  I nearly *always* use 100% as the maximum value, In these last two graphs I’ve used a maximum value of 24% because utilization of each of the 4 vcpus is really low and I felt it made the graphs slightly more readable.)
Hey!! Perfmon was collecting 30 second samples.  But two intervals after 2:06:22 is 2:07:52 rather than 2:07:22.  The system lost 30 seconds in the span of a minute.  And two intervals past 2:07:52 is 2:11:22 – the system lost 150 seconds in a minute?

It would appear that there are a couple of issues with this box, one of which is a lack of appreciation of the current time.

Azure Price Cuts

Kevin Feasel



Brad Sams reports that Azure VM and Azure Blob Storage prices are going down:

Microsoft, Amazon and now Google are in a heated cloud race to grab as much market share as they can as they know that once a company starts using their service, the likelihood of switching platforms is low. With more services being offered via cloud vendors and more companies diving into these platforms, Microsoft and Amazon are frequently cutting prices to create a competitive advantage.

On this edition of ‘cloud cuts’, Microsoft is slashing prices on some of its Azure Virtual Machines and its Blob storage. The company is dropping the prices on compute-optimized instances – F Series and general purpose instances – A1; the company says pricing cuts on its D-series general purpose instances will happen in the near future.

Blob storage is down to 2 cents per GB per month for hot storage.  That’s slightly below S3’s 2.3 cents per GB per month.

Integrating Data Lake Storage With SQL Data Warehouse

Sachin Sheth alerts us to a new integration point between Azure Data Lake Storage and Azure SQL Data Warehouse via Polybase:

Most common patterns using Azure Data Lake Store (ADLS) involve customers ingesting and storing raw data into ADLS. This data is then cooked and prepared by analytic workloads like Azure Data Lake Analytics and HDInsight. Once cooked this data is then explored using engines like Azure SQL Data Warehouse. One key pain point for customers is having to wait for a substantial time after the data was cooked to be able to explore it and gather insights. This was because the data stored in ADLS would have to be loaded into SQL Data Warehouse using tools row-by-row insertion. But now, you don’t have to wait that long anymore. With the new SQL Data Warehouse PolyBase support for ADLS, you will now be able to load and access the cooked data rapidly and lessen your time to start performing interactive analytics. PolyBase support will allow to you access unstructured/semi-structured files in ADLS faster because of a highly scalable loading design. You can load the files stored in ADLS into SQL Data Warehouse to perform analytics with fast response times or you use can the files in ADLS as external tables. So get ready to unlock the value stored in your petabytes of data stored in ADLS.

I’ve been waiting for this support, and I’m happy that they were able to integrate the two products.

Upgrading Replication To SQL Server 2016

Amit Banerjee explains the steps for upgrading replicated SQL Server instances to 2016:

  • A Distributor can be any version as long as it is greater than or equal to the Publisher version (in many cases the Distributor is the same instance as the Publisher).

  • A Publisher can be any version as long as it less than or equal to the Distributor version.

  • Subscriber version depends on the type of publication:

    • A Subscriber to a transactional publication can be any version within two versions (n-2) of the Publisher version. For example: a SQL Server 2012 Publisher can have SQL Server 2014 and SQL Server 2016 Subscribers; and a SQL Server 2016 Publisher can have SQL Server 2014 and SQL Server 2012 Subscribers.

    • A Subscriber to a merge publication can be any version less than or equal to the Publisher version.

There are several options available here, and it’s worth reading through if you’re looking to upgrade soon.


Kevin Feasel



Tomaz Kastrun explains how the RevoScaleR package is useful:

RevoScaleR package and computational function were designed for parallel computation with no memory limitation, mainly because this package introduced it’s own file format, called XDF. eXternal Data Frame was designed for fast processing of smaller chunks of data, and gains it’s efficiency when reading and writing the XDF data by loading chucks of data into RAM one by at a time and only what is needed. The way this is done, means no limitations for the size of RAM, computations run much faster (because it is using C++ to write these algorithms, which is faster than original, which were written in interpretative language). Data scientist still make a single R call, bur R will use distrubuteR component to determine, how many cores, sockets and threads are available and then launch smaller portion of load into each thread, analyze data a bit at a time. With XDF, data is retrieved many times, but since it is 5-10times smaller (as I have already shown in previous blog posts when compared to *.txt or *.csv files), and it is written and stored into XDF file the same way as it was extracted from the memory, it enables faster computations, because no parsing of data chunks is required and because of the way, how data is stored, is maximizes the retrieval time of the data.

If you’re using SQL Server R Services, these rx functions will become very important to you.

Azure SQL Database Extended Events

Arun Sirpal compares on-prem extended events to what’s available in Azure SQL Database:

There are 22 actions and 261 events. Naturally less than your local based SQL Servers, for example on my local 2014 machine running the above query returned 50 actions and 284 events.

There are a few subtle differences and a couple not-so-subtle differences, so it’s worth digging into if you plan to spin up an Azure SQL Database database.

Local Azure Data Lake

Julie Koesmarno shows how to set up Azure Data Lake for local testing:

Late last year, I presented a Cognitive Intelligence demo using Azure Data Lake (ADL) at PASS Summit keynote. It was a fun and quick demo! Watch it here :)

In case you’re new to ADL, you can now (since Dec 2015) develop, compile and run ADL locally in Visual Studio. This is huge! Because you don’t have to worry about your ADL Analytics Unit (AU) consumptions. Plus, this allows you to try it before you buy it too!

Click through for the step-by-step installation instructions.

SSRS Sender Display Names

Andy Mallon shows how to customize a display name for a sender e-mail address in Reporting Services:

The business wants the sending email address to be pretty. They want to identify the application, project, company, or… who knows what. The business wants anything but the server name. They might want it to come from [email protected] or [email protected] or [email protected] or…who knows what.

The problem with these more generic sender addresses is that they don’t help me troubleshoot. Is [email protected] coming from SSRS or the application server? Which SSRS server is [email protected] coming from? I don’t want to check 10 SSRS servers to figure out which one sent a report.

Click through to find out how to make the troubleshooting DBA and the business side happy.

Understanding Memory Grants

Erik Darling explains memory grants in SQL Server:

Our query memory grants range from around 8 MB to around 560 MB. This isn’t even ordering BY the larger columns, this is just doing the work to sort results by them. Even if you’re a smarty pants, and you don’t use unnecessary ORDER BY clauses in your queries, SQL may inject them into your query plans to support operations that require sorted data. Things like stream aggregates, merge joins, and occasionally key lookups may still be considered a ‘cheaper’ option by the optimizer, even with a sort in the plan.

Of course, in our query plans, we have warnings on the last two queries, which had to order the VARCHAR(8000) column.

This shows just how much difference a simple column size can make.


February 2017
« Jan Mar »