Spark Changes In HDP 2.6

Vinay Shukla and Syed Mahmood talk about what’s new with Spark and Zeppelin in the Hortonworks Data Platform 2.6 update:


Most data scientists use R & Python and with SparkR & PySpark respectively they can continue to leverage their familiarity with the R & Python languages. However, they need to use the Spark API to leverage Machine learning with Spark and to take advantage of distributed computations. Both SparkR & PySpark are evolving rapidly and SparkR now supports a number of machine learning algorithms such as LDA, ALS, RF, GMM GBT etc. Another key improvement in SparkR is the ability to deploy a package interactively. This will help Data Scientists deploy their favorite R package in their own environment without stepping on other users.

PySpark now also supports deploying VirtualEnv and this will allow PySpark users to deploy their libraries in their own individual deployments.

There are several large changes, so check it out.

Custom ggplot2 Subplots

Ilya Kashnitsky shows how to create custom subplots using ggplot2:

Actually, ggplot2 is a very powerful and flexible tool that allows to draw figures with quite a complex layout. Today I want to show the code that aligns six square plots (actually, maps) just as in the figure above. And it’s all about the handy function ggplot2::annotation_custom(). Since I used the layout more than once, I wrapped the code that produced it into a function that takes a list of 6 square plots as an input and yields the arranged figure with arrows as an output. Here is the commented code of the function.

This is the difference between “I’m just going to throw some stuff on there” (which is how I tend to operate) versus well thought out visual layout.

Microsoft R Open 3.4.0

David Smith announces Microsoft R Open 3.4.0:

R 3.4.0 (upon which MRO 3.4.0 is based) is a major update to the R language, with many fixes and improvements. Most notably, R 3.4.0 introduces a just-in-time (JIT) compiler to improve performance of the scripts and functions that you write. There have been a few minor tweaks to the language itself, but in general functions and packages written for R 3.3.x should work the same in R 3.4.0. As usual, MRO points to a fixed CRAN snapshot from May 1 2017, but you can use the built-in checkpoint package to access packages from an earlier date (for compatibility) or a later date (to access new and updated packages).

The version of Microsoft R Server shipping with SQL Server 2017 will still be based on 3.3.3, but I’m going to guess that a new version of Microsoft R Server supporting 3.4.0 will ship in the next several months.

Disabling The Identity Cache

Andrew Pruski looks at a configuration setting in SQL Server 2017 which prevents SQL Server from bumping the next identity value on failover or restart:

Now we can check the data in the table: –

SELECT * FROM [dbo].[TestTable]

As expected, there’s a gap in the IDs. Why it jumps to 1002 is discussed in the connect item.

OK, now let’s try running the same code again but this time we will disable the identity cache.

This doesn’t eliminate gaps altogether; those can still happen upon rollback of a transaction.  This is reason #1 why you should not use identity columns as proper sequences.

Building A Concatenated Tooltip In Power BI

Devin Knight has started a new series, walking through problems his clients have faced implementing Power BI solutions.  In this edition, Devin wants to build a comma-delimited list to display on a tooltip:

This works perfectly for Stock because it automatically summarizes the value but, you’ll notice above that the tooltip for Subcategory has an interesting behavior. Rather than displaying the list of the values in Subcategory it actually just show the very first value. This happens because the Tooltip field requires that any column used in it be able to aggregate or roll up the values into what’s shown on the chart. Since Subcategory is just a text field Power BI automatically applies the FIRST function to return back the first value that appears. You could optionally change this from FIRST to either LAST, COUNT, or COUNTDISTINCT.

So the real problem I want to solve here is rather than only showing the first subcategory how do I list all the subcategories in a comma separated list in the tooltip? Let’s walk through a couple possible designs to this solution.

Read on for two different designs, including the code to implement the solutions.

Fisher’s Exact Test

Mala Mahadevan explains Fisher’s Exact Test and provides examples in T-SQL and R:

The decision rule in two sample tests of hypothesis depends on three factors :
1 Whether the test is upper, lower or two tailed (meaning the comparison is greater, lesser or both sides of gender and speaker count)
2 The level of significance or degree of accuracy needed,
3 The form of test statistic.
Our test here is to just find out if gender and speaker count are related so it is a two tailed test. The level of significance we can use is the most commonly used 95% which is also the default in R for Fischer’s Test. The form of the test statistic is P value. So our decision rule would be that gender and speaker category are related if P value is less than 0.05.

Click through for the R code followed by a code sample which should explain why you don’t want to do it in T-SQL.

Generating Comma-Delimited Strings

Shane O’Neill has a good reason to upgrade to SQL Server 2017, which is not having to deal with FOR XML PATH hacks anymore:

Commas are all the rage nowadays:

There are a fair number of questions nowadays about returning data from a database in a comma separated string. Sure the application should probably do that but hey, database servers are expensive, why not get some bang for your bucks!

Protip:  I use the colon as a separator because sometimes commas will get stuck in the Ethernet cable.


James Serra enumerates high availability options for SQL Server Analysis Services:

If you are looking at providing high availability (HA) for SSAS, here are 3 options:

  1. Install SSAS on a Windows Server Failover Cluster (WSFC)Here’s a good article. The main issue with this option is that SSAS isn’t cluster-aware, so if windows is “OK” but SSAS (the service) is hung, it won’t failover

Read on for the other options.

Understanding CROSS APPLY

Andy Levy has a T-SQL programming breakthrough:

Finally, this week I had a breakthrough. I was working on updating a bunch of data but it was breaking on a small subset of that data. In this case, I was attempting to JOIN two tables on fields that should have been INTs, but in a very small number of cases one side was using a comma-delimited string. The user told me that someone else had done these updates in the past and didn’t encounter the problem I was having (so I knew that it was something i was doing “wrong”), but given that it was only a handful of broken updates she was OK with manually doing the updates (we were scripting it because we were updating potentially tens of thousands of records).

I am not OK with manually fixing this in the future. I wanted to know how the other DBA had done it before. I dug into some history and found CROSS APPLY. My nemesis. I was determined to figure out how to use it this time.

The APPLY operator is extremely powerful in the right set of circumstances.  Andy shows the “classic” use case, but there are a number of other uses for the operator.

Figuring Out Virtual Sockets And Cores

Denny Cherry looks at a few considerations regarding virtual sockets and cores for a VM running SQL Server:

Standard Edition

You wants 1×6 (one socket, 6 cores) because standard edition will only use the first 4 sockets in a server (up to 16 cores combined). There’s no getting around that.

From a NUMA perspective as long a vNUMA at the Hypervisor is disabled then it doesn’t matter as SQL Server standard edition isn’t NUMA aware (NUMA awareness is an Enterprise Edition feature).

Read on for a more nuanced answer when it comes to Enterprise Edition.


May 2017
« Apr