Neural Nets On Spark

Nisha Muktewar and Seth Hendrickson show how to use Deeplearning4j to build deep learning models on Hadoop and Spark:

Modern convolutional networks can have several hundred million parameters. One of the top-performing neural networks in the Large Scale Visual Recognition Challenge (also known as “ImageNet”), has 140 million parameters to train! These networks not only take a lot of compute and storage resources (even with a cluster of GPUs, they can take weeks to train), but also require a lot of data. With only 30000 images, it is not practical to train such a complex model on Caltech-256 as there are not enough examples to adequately learn so many parameters. Instead, it is better to employ a method called transfer learning, which involves taking a pre-trained model and repurposing it for other use cases. Transfer learning can also greatly reduce the computational burden and remove the need for large swaths of specialized compute resources like GPUs.

It is possible to repurpose these models because convolutional neural networks tend to learn very general features when trained on image datasets, and this type of feature learning is often useful on other image datasets. For example, a network trained on ImageNet is likely to have learned how to recognize shapes, facial features, patterns, text, and so on, which will no doubt be useful for the Caltech-256 dataset.

This is a longer post, but on an extremely interesting topic.

Running H2O In R On Azure HDInsight

Daisy Deng shows how to configure HDInsight to be able to run the H2O package in R rather than Python or Scala:

We provide a few script actions for installing rsparkling on Azure HDInsight. When creating the HDInsight cluster, you can run the following script action for header node:

And run the following action for the worker node:

Please consult Customize Linux-based HDInsight clusters using Script Action for more details.

Click through for the full process.

Using Power Query’s Combine Feature

Matt Allington explains how to use the Combine feature in Power Query:

Now the good news is that somehow the combine button has done a reasonable job at combining the files (see 1 below).  Actually I wanted to unpivot the month columns, but I will come back to that later.

The bad news is all of the activity over on the left hand side.  What the…?  There are 5 additional “Queries” on the left (numbered 2 – 6) that do all sorts of things.  Let me tell you what each of these are and then come back and explain how to use/interpret these things.

2. This is a parameter that can be used to change the sample file

3. This is the link to the sample file that was selected originally (I selected the first file in the folder and this is the link to that file).

4. This is the “by example query” – the most important query to know about.

5. This is an auto generated function that goes with 4 above.

6. This is the final output query (it is the query that is displayed in 1 above).

Matt clearly explains the whole process, so read on if you need to combine files of Power Query.

Indirect Checkpoint And Non-Yielding Scheduler Problems

Parikshit Savjani has a post describing an issue you might experience with indirect checkpoint post SQL Server-2012:

One of the scenarios where skewed distribution of dirty pages in the DPList is common is tempdb. Starting SQL Server 2016, indirect checkpoint is turned ON by default with target_recovery_time set to 60 for model database. Since tempdb database is derived from model during startup, it inherits this property from model database and has indirect checkpoint enabled by default. As a result of the skewed DPList distribution in tempdb, depending on the workload, you may experience excessive spinlock contention and exponential backoffs on DPList on tempdb. In scenarios when the DPList has grown very long, the recovery writer may produce a non-yielding scheduler dump as it iterates through the long list (20k-30k) and tries to acquire spinlock and waits with exponential backoff if spinlock is taken by multiple IOC routines for removal of pages.

This is worth taking a close read.

Bacpacing In Azure

Derik Hammer shows how to use a bacpac file to deploy an existing database to Azure SQL Database:

The recommended method for working with Azure is always PowerShell. The Azure portal and SSMS are tools there for your convenience but they do not scale well. If you have multiple databases to migrate, potentially from multiple servers, using PowerShell will be much more efficient. Scripting your Azure work makes it repeatable and works towards the Infrastructure as Code concept.

In this demonstration, the below steps will be used.

  1. Export the bacpac file to a local directory with sqlpackage.exe.

  2. Copy the bacpac to Azure Blob Storage with AzCopy.exe

  3. Use the PowerShell AzureRM module and cmdlets to create an Azure SQL Database from the bacpac file.

Derik shows the point-and-click way as well as the Powershell way.

Remote Installation Of Powershell On SQL Server 2017

Tracy Boggiano has a script to install the SQL Server 2017 Powershell module via Powershell remoting:

Once the file is copied to the server locally you can use run the below script from your local machine to install SQL Server PowerShell 2017 for all user to use on the server.  The Get-CMSHosts function can be found on my blog here. You will need to download PsExec from here and extract to location on your local computer and provide the path.

NOTE: This is a hotfix that requires a RESTART. Be careful in PRODUCTION.

Click through for the script.

Polybase Design Patterns On Azure SQL DW

Simon Whiteley continues his Polybase on Azure SQL Data Warehouse series.  First, he covers data loading patterns:

That’s enough about data loading for now, there’s another major use case for Polybase that we haven’t yet discussed. Many data processing solutions have a huge, unwieldy overnight batch job that performs aggregates, lookups, analytics and various other calculations.

However, it is often the case that this is not timely enough for many business requirements. This is where Polybase can help.

If we have an External Table over the newest files, this will read these new records at query time. We can write a view that combines the External Table with our batch-produced table. This will obviously go a little slower than usual, given it has to read data from flat files each time, however the results returned will be up to date.

Simon then covers the Create Table As Select statement:

In order to utilise SQLDW effectively, we write SQL for our transformations, rather than relying on external tools such as SSIS. This ensures the work is being done by our compute nodes and, therefore, can be scaled up to increase performance.

General best practice, therefore, would be write stored procedures for each of the data movements we want to occur. This allows us to add in auditing, logging etc. But what we’re interested in here is the core data movement itself.

Writing a traditional INSERT statement isn’t the fastest way to get data into a table. There is a special syntax which creates a new table and inserts into it, that is automatically configured for optimal bulk loading, this is the CTAS, or “Create Table as Select” statement.

This is a pair of interesting posts from Simon.

Self-Analysis Of SQL Server Dump Files

Arun Sirpal walks through the SQL Server Diagnostics preview:

Notice the region to upload – If you are using a work machine I would suggest getting authorisation. The great thing here is that this is GDPR compliant.

Once ready hit the upload button, it goes through 3 phases. Upload, Analysis and a recommendation.

It sends your dump files to an external service, which is important enough to point out.  If you want more details on the product, Rony Chatterjee has a FAQ.

New T-SQL Functions

Dennes Torres shows off four T-SQL functions releasing in SQL Server 2017:


Translate does the work of several replace functions, simplifying some queries.

The function is called ‘Translate’ because its main objective: transform one kind of information in another by doing a bunch of replaces.

For example: GeoJson and WKT are two different formats for coordinates. In GeoJson a coordinate is represented using the format ‘[137.4, 72.3]’ while in WKT a point is represented using the format ‘(137.4 72.3)’.

We would need several ‘Replace’s to transform GeoJson format in WKT format and the reverse. The ‘Translate’ function can do this easily.

I knew the first three, but Translate sneaked right past me.


June 2017
« May Jul »