Microsoft ML Server 9.3 Released

Nagesh Pabbisetty announces Microsoft Machine Learning Server 9.3:

In ML Server 9.3, we have added support for SQL compute context in ML Server and in R Client running on Linux platforms, so data scientists who work on Linux workstations can directly use in-database analytics with SQL Server compute context. Additionally, the SQLRUtils package can now be used to package the R scripts into T-SQL stored procedures and run them from R environment on Linux clients.

An interesting scenario enabled by the addition of SQL Server Compute context in ML Server running on Linux is that organizations can now provide a browser-based interface for accessing SQL Server compute context with R Studio Server and ML Server running on a Linux machine connecting to SQL Server.

Since introducing revoscalepy library in the last release of ML Server and SQL Server 2017, we have shipped several additions and improvements in the Python APIs as part of CU releases of SQL Server 2017. We have added APIs like rx_create_col_info, rx_get_var_info etc. that make it easier to get column information, esp. with large number of columns. We added rx_serialize_model for easy model serialization. We have also improved performance when working with string data in different scenarios.

This also gets you up to R 3.4.3. H/T David Smith

Looping In Python And R

Kevin Feasel


Python, R

Dmitry Kisler has a quick comparison of looping speed in Python and R:

This post is about R versus Python in terms of the time they require to loop and generate pseudo-random numbers. To accomplish the task, the following steps were performed in Python and R (1) loop 100k times (ii is the loop index) (2) generate a random integer number out of the array of integers from 1 to the current loop index ii (ii+1 for Python) (3) output elapsed time at the probe loop steps: ii (ii+1 for Python) in [10, 100, 1000, 5000, 10000, 25000, 50000, 75000, 100000]

The findings were mostly unsurprising to me, though there was one unexpected twist.

Starting A Data Science Project: Business Understanding

I continue my data science project series:

As you listen to these types of questions, your goal is to nail down a specific problem with a specific answer.  You want to narrow down the scope to something that your team can achieve, ideally something with a built-in measure for success.  For example, here are a few specific problems that we could go solve:

  • Find a model which predicts quarterly sales to within 5% no later than 30 days into the quarter.
  • Given a title and description for a product, tell me a listing category which Amazon will, with at least 90% confidence, consider valid for this product.
  • Determine the top three factors which most affect the number of years the first owner holds onto our mid-range sedan.

With a specific problem in mind, you can look for relevant data.  Of course, you’ll probably need to modify the scope of this problem over time as you gather new information, but this gives you a starting point for success.  Also, don’t expect something as clear-cut as the above early on; instead, people will hem and haw, not quite sure what they really want.  You can take a fuzzy goal into data acquisition, but as you acquire data, you will want to work with the champion to focus down to a targeted and valuable problem.

Read on for several references to big sacks of cash.  After becoming a manager, I’ve become much more attuned to the idea of receiving big sacks of cash.

Documenting Reporting Services Installations

Craig Porteous explains the types of things you should document in SQL Server Reporting Services:

If you’re using Kerberos authentication with Reporting Services you’ll at least have to update the rsReportServer.config file with the correct authentication mode. Beyond that you have SPNs on your SSRS domain Service account to consider. This may be managed by your domain administrators but it’s still a good plan to back these up or check that someone is backing these up. It is too easy to clear SPNs or constrained delegation settings accidentally.

Having a backup will also give you a good reference point for building new environments. You can of course easily see SPNs from the command line using this command:


What this boils down to is, if you lost your Reporting Services installation today, could you get things back to the way they were?

Concept Help In Powershell

Fred Weinmann gives us a reading list of Powershell built-in documentation:

Newcomers to PowerShell are often faced with a lot of things to learn, a lot of disorienting features and little in the way of obvious guidance. While there are quite a few good books out there by now, finding the right one is not always easy.
That said, PowerShell comes with a lot of built in documentation:
Get-Help about_*
Will list a whole lot of articles explaining many a concept, but finding just what you need – especially before you know all the right words to search for – is a lot harder.

There’s a lot to this and more seasoned developers can skip a lot of it, but if you’re just learning about Powershell, I think it’s a nice ordering of where to start learning.

Gapless Sequential Values

Kevin Feasel



Gail Shaw shows how to build gapless sequential values:

To be clear, I don’t think this is a good idea. The identity column works well if a sequential series of numbers are needed. If the number sequence needs to that spans tables, then the sequence object is a good replacement.

But, there will always be some requirements that insist on gap-less sequences, or insist on not using identity (probably for ‘compatibility’ reasons), so let’s see how to do it properly.

For true surrogate keys, this is a bad idea because it is both unnecessary and overkill.  Where this becomes useful is cases where an auditor is expecting a proper sequence without any gaps, such as invoice numbers, check numbers, or purchase order numbers.  As Gail mentions, those are uncommon scenarios.

Adaptive Join Behavior

Grant Fritchey explains adaptive joins in SQL Server 2017:

Currently the adaptive join only works with columnstore indexes, but according to Microsoft, at some point, they will also work with rowstore. The concept is simple. For larger data sets, frequently (but not always, let’s not try to cover every possible caveat, it depends, right), a hash join is much faster than a loops join. For smaller data sets, frequently, a loops join is faster. Wouldn’t it be nice if we could change the join type, on the fly, so that the most effective join was used depending on the data in the query. Ta-da, enter the adaptive join.

First, the statistics are used at compile time for the tables we’re joining. Based on those statistics, a row target it set. Below that threshold, a loops join will be used. Above that threshold, a hash join. The way the row count is determined is that the operator will always build the hash table. With the hash table built and loaded, it will know how many rows it has. If it’s going to do a loops join, the hash table is tossed and a loops join commences. If the threshold has been passed on the row counts and it’s going to do a hash join, it already has the hash table built and proceeds to do a hash join. It’s easy to understand. Let’s see it in action.

Click through to see it in action.  It’s not a world-changer yet, but as it becomes available to rowstore queries (without the filtered, empty columnstore index trick), I think people will come to appreciate it.


Randolph West argues against using the SMALLDATETIME data type:

But let’s say you don’t need that kind of accuracy and are happy with a granularity to the nearest minute. Maybe you’re storing time cards and don’t think it’s necessary to store seconds. As discussed in the Fundamentals series, you really want to choose the most appropriate datatype for your data.

Enter SMALLDATETIME, which rounds up or down to the nearest minute. The seconds value for any SMALLDATETIME is 00. Values of 29.999 seconds or higher are automatically rounded up to the nearest minute, while values of 29.998 seconds or lower are rounded down.

Read on to see Randolph’s explanation of why he recommends against using SMALLDATETIME.

Building A Biml Project: Adding Tasks

Kevin Feasel



Martin Schoombee continues his series on building a Biml project by adding SSIS tasks:

The <DirectInput> tag is pretty self-explanatory and executes the stored procedure (in this case). It could also have been a SQL query.

The <Results> element contains a <Result> child element for each field returned in our query. I found this a little confusing at first, but the Name property’s value should reflect the name of the field that is returned from your query or stored proc.

The VariableName property is then used to map the returned field’s value to the relevant variable, and note that we don’t use the “User::<variable>” notation that we see elsewhere in SSIS, but we refer to the scope of the variable with the “User.” prefix. I’m guessing that the change in syntax is because of some other dependencies and/or potential clashes between the Biml compiler and SSIS itself.

This has been a good introductory-level series so far; check it out if you haven’t already.

Launching A Sparklyr Cluster

David Smith shows how to launch a sparklyr cluster in Azure:

When you’re finished, shut down your cluster using the aztk spark cluster delete command. (While you can delete the nodes from the Pools view in the Azure portal, the command does some additional cleanup for you.) You’ll be charged for each node in the cluster at the usual VM rates for as long as the cluster is provisioned. (One cost-saving option is to use low-priority VMs for the nodes, for savings of up to 90% compared to the usual rates.)

That’s it! Once you get used to it, it’s all quick and easy — the longest part is waiting for the cluster to spin up in Step 5. This is just a summary, but the full details see the guide SparklyR on Azure with AZTK.

It’ll take a bit more than five minutes to get started, but it is a good sight easier than building the servers yourself.


February 2018
« Jan Mar »