John Mount shows off replyr, which is dplyr for remote, distributed data sets (think SparkR or sparklyr):

Suppose we had a large data set hosted on a Spark cluster that we wished to work with using dplyr and sparklyr (for this article we will simulate such using data loaded into Spark from the nycflights13 package).

We will work a trivial example: taking a quick peek at your data. The analyst should always be able to and willing to look at the data.

It is easy to look at the top of the data, or any specific set of rows of the data.

Read on for more details.

R 3.3.3 Released

David Smith alerts us to R 3.3.3:

The R core group announced today the release of R 3.3.3 (code-name: “Another Canoe”). As the wrap-up release of the R 3.3 series, this update mainly contains minor bug-fixes. (Bigger changes are planned for R 3.4.0, expected in mid-April.) Binaries for the Windows version are already up on the CRAN master site, and binaries for all platforms will appear on your local CRAN mirror within the next couple of days.

For now, I’m holding out until R 3.4.0.


Jiang Mouren has a two-parter on WebHCat.  First, how it works:

SSH shell/Oozie hive action directly interact with YARN for HIVE execution where as Program using HdInsight Jobs SDK/ADF (Azure Data Factory) uses WebHCat REST interface to submit the jobs.

WebHCat is a REST interface for remote jobs (Hive, Pig, Scoop, MapReduce) execution. WebHCat translates the job submission requests into YARN applications and reports the status based on the YARN application status. WebHCat results are coming from YARN and troubleshooting some of them needs to go to YARN.

Then, how to debug issues:

2.1.2. WebHCat times out

HDInsight Gateway times out responses which take longer than 2Minutes resulting in “502 BadGateway”. WebHCat queries YARN services for job status and if they take longer than the request might timeout.

When this happens collect the following logs for further investigation:

/var/log/webchat. Typical contents of directory will be like

  • webhcat.log is the log4j log to which server writes logs
  • webhcat-console.log is stdout of server is started.
  • webhcat-console-error.log is stderr of server process

NOTE: webhcat.log will roll-over daily hence files like webhcat.log.YYYY-MM-DD will also present. For logs to a specific time range make sure that appropriate file is selected.

Because HDInsight doesn’t support WebHDFS, WebHCat is the primary method for cluster access, so it’s good to know.

Image Viewer Custom Visual

Devin Knight continues his Power BI custom visuals series with the image viewer:

In this module you will learn how to use the Image Viewer Power BI Custom Visual.  The Image Viewer visual helps in displaying images based on an image URL stored in your data.

This is an interesting visual.

Collapsable Subqueries

Dmitry Pilugin notes a new query simplification rule in SQL Server vNext:

You may see that in the first plan, there are two clustered index scans of the table SalesOrderDetail, however the subquery is exactly the same “exists (select * from Sales.SalesOrderDetail f where f.SalesOrderID = d.SalesOrderID)” but referenced twice.

In the second case, compiled under next compatibility level, the double reference of the subquery is collapsed and we see only one reference to the SalesOrderDetails table and more efficient plan, despite the query still has two subqueries with SalesOrderDetails.

In the third case, also compiled under vNext level, we see the second branch with the SalesOrderDetail again, but that is because we turned off the rule CollapseIdenticalScalarSubquery with an undocumented hint queryruleoff (which I originally described in my blog post).

I think Dmitry has the expected use case nailed:  ORMs.  But I can see people writing (well, copy-pasting) similar queries, so maybe it’ll be useful in more contexts as well.

Checkpoints And Memory-Optimized Filegroups

Jack Li explains why, even without a memory-optimized table, you can see XTP checkpoints in your database:

What are checkpoint files?

They are data and delta files as documented in Durability for Memory-Optimized Tables. When you use disk based tables, the data is written to data files.  Even though data is stored in memory for memory optimized tables, SQL Server still needs to persists data for disaster recovery.  Data for memory optimized tables is stored in what we call checkpoint files.  Data file contains rows from insert and update operations. Delta file contains deleted rows.  Over time, these files can be ‘merged’ increase efficiency.  Unneeded files after the merge can be removed eventually (but this can only happen after a log backup).

Click through for a demo script to see this in action.

Power BI Themes

Gogula Aralingam notes that Power BI can now support basic skinning:

The March 2017 update of Power BI Desktop comes with a preview of Themes. Right now it is in its simplest of forms: You manually create a JSON file that has a very few attributes that can set basic color themes to your reports. So all you have to do is create file that looks like this:

Click through for an example.  This isn’t a true fix for the lack of Color Vision Deficiency support, but you can plug in safe colors (for example, this article includes some) and skirt the issue until there’s real support.

Saving Docker Images

Andrew Pruski shows how to save a Docker image:

What I’m going to do now is save my custom image to a location [C:\temp] on my host server. So I’ll run: –

docker save -o myfirstimage.tar myfirstimage

N.B. – I’m saving the file as a .tar as this seems to be the format that works best for loading the image. I’ve tried exporting to other formats (.zip) but it always seems to become corrupt. The .tar format has always worked for me so I’ll stick with it.

If I’m understanding Andrew correctly, that’s because docker save will only generate a tar archive.  Keep reading for more details, including how to restore a Docker image to another machine.

Using Desired State Configuration To Install SQL Server

Chris Lumnah shows how to use Powershell Desired State Configuration to automate installation and configuration of a new SQL Server instance:

So the installation of SQL Server is now fairly straightforward. The wizard does a nice job of guiding you along the way. 2016 even includes best practice suggestions for tempdb and instance file initialization. Along the way, Microsoft as given us ways to automate the installation of SQL Server. You can sysprep an instance, but this does not really automate the installation. It just helps create a template of an instance. At the end of the day, you still need to do things manually. You can also use a configuration file to assist here. This is a great step forward, but it does not allow for all of the things you need to do to configure a SQL server.

Powershell does. Desired State Configuration (DSC) is functionality built into Powershell that allows for the installation and configuration of a SQL Server.

Chris includes his script as well as a link for more information on DSC in case you aren’t familiar with the concept.


March 2017
« Feb