Scaling Out Random Forest

Denis C. Bauer, et al, explain VariantSpark RF, a random forest algorithm designed for huge numbers of variables:

VariantSpark RF starts by randomly assigning subsets of the data to Spark Executors for decision tree building (Fig 1). It then calculates the best split over all nodes and trees simultaneously. This implementation avoids communication bottlenecks between Spark Driver and Executors as information exchange is minimal, allowing it to build large numbers of trees efficiently. This surveys the solution space appropriately to cater for millions of features and thousands of samples.

Furthermore, VariantSpark RF has memory efficient representation of genomics data, optimized communication patterns and computation batching. It also provides efficient implementation of Out-Of-Bag (OOB) error, which substantially simplifies parameter tuning over the computationally more costly alternative of cross-validation.

We implemented VariantSpark RF in scala as it is the most performant interface languages to Apache Spark. Also, new updates to Spark and the interacting APIs will be deployed in scala first, which has been important when working on top of a fast evolving framework.

Give it a read.  Thankfully, I exhibit few of the traits of the degenerative disease known as Hipsterism.

sparklyr 0.6 Released

Javier Luraschi announces sparklyr 0.6:

We’re excited to announce a new release of the sparklyr package, available in CRAN today! sparklyr 0.6 introduces new features to:

  • Distribute R computations using spark_apply() to execute arbitrary R code across your Spark cluster. You can now use all of your favorite R packages and functions in a distributed context.

  • Connect to External Data Sources using spark_read_source()spark_write_source()spark_read_jdbc() and spark_write_jdbc().

  • Use the Latest Frameworks including dplyr 0.7DBI 0.7RStudio 1.1and Spark 2.2.

I’ve been impressed with sparklyr so far.

Reducing Dimensionality

Antoine Guillot explains some of the basic concepts of variable reduction in a data analysis:

Each of these people can be represented as points in a 3 Dimensional space. With a gross approximation, each people is in a 50*50*200 (cm) cube. If we use a resolution of 1cm and three color channels, then can be represented by 1,000,000 variables.
On the other hand, the shadow is only in 2 dimensions and in black and white, so each shadow only needs 50*200=10,000 variables.
The number of variables was divided by 100 ! And if your goal is to detect human vs cat, or even men vs women, the data from the shadow may be enough.

Read on for intuitive discussions of techniques like principal component analysis and linear discriminant analysis.  H/T R-Bloggers

Moving SQL Server Data Files

Jana Sattainathan walks us through the process of moving a SQL Server data file from one drive to another:

Space got tight on a drive and I knew that there was space on another drive. I had already set this particular database with multiple secondary file groups/files (.ndf files) instead of a huge and single .mdf file (which would have made the whole thing a lot harder).

The procedure to relocate data files is extremely simple

Click through for Jana’s seven salvos of administrator success.

Automating tSQLt Tests

James Anderson shows how to integrate tSQLt tests as part of his ReadyRoll pipeline:

By default, ReadyRoll will ignore tSQLt objects, including our tests. We don’t want ReadyRoll to script out the tSQLt objects, but we do want it to script our tests. To set our filter we need to unload the project in VS and edit the project file. Add the following to the section named ReadyRoll Script Generation Section:

James’s series is really coming together at this point, so if you haven’t been reading, check out the links in his post.

T-SQL Tuesday 92 Roundup

Raul Gonzalez wraps up T-SQL Tuesday #92:

On July 2017’s event the proposed topic was aimed for all you to share those little secrets that made your tummy burn after pressing F5.

Since early in the morning I’ve been reading your posts which makes me very happy and feel the topic was certainly well accepted by the community.

In order of published date these are the posts that took part in this month’s event.

Click through to see the 17 entries this month.

Finding DAX Measures In Use

Matt Allington shows an easy way to enumerate the DAX measures in a Power Pivot workbook in Excel:

I have written articles before about how you can extract measures from a data model using DAX Studio and also using Power Pivot Utilities.  These are both excellent tools in their own right and I encourage you to read up on those previous articles to learn more about these tools. Today however I am going to share another way you can extract a list of measures from an Excel Power Pivot Workbook without needing to install either of these 2 (excellent) software products.  I often get called in to help people with their workbooks and sometimes they don’t have the right software installed for me to extract the list of measures (ie DAX Studio or PPU).  This article explains how to extract the measures quickly without installing anything.

Matt uses a simple SQL statement to pull measure data into an Excel table, making it easy to retain the set of measures.  There are some built-in documentation possibilities with this.

Filesystem Enumeration DMV

Erik Darling points out a new DMV in SQL Server 2017:

SQL Server 2017 RC1 dropping recently reminded me of a couple things I wanted to blog about finding in there. One that I thought was rather interesting is a new iTVF called dm_os_enumerate_filesystem. It looks like a partial replacement for xp_cmdshell in that, well, you can do the equivalent of running a dir command, with some filtering.

The addition of a search filter is particularly nice, since the dircommand isn’t exactly robust in that area. If you’ve ever wanted to filter on a date, well… That’s probably when PowerShell got invented.

If I run a simple call to the new function like so…

Click through to see Erik use the new function.

MERGE With Deletion

Kevin Wilkie shows an example of deleting data as part of a merge operation:

The last time we were together, we learned how to use the MERGE statement when we wanted to insert rows that didn’t exist and update rows that didn’t. This time we’re going to add onto that. We’re adding the seldom used, but delightfully potent – delete rows that no longer exist in the original table.

MERGE is an enticing but dangerous piece of syntax.  It looks so nice until you realize how many bugs and oddities there are in the command.


August 2017
« Jul