VariantSpark RF starts by randomly assigning subsets of the data to Spark Executors for decision tree building (Fig 1). It then calculates the best split over all nodes and trees simultaneously. This implementation avoids communication bottlenecks between Spark Driver and Executors as information exchange is minimal, allowing it to build large numbers of trees efficiently. This surveys the solution space appropriately to cater for millions of features and thousands of samples.
Furthermore, VariantSpark RF has memory efficient representation of genomics data, optimized communication patterns and computation batching. It also provides efficient implementation of Out-Of-Bag (OOB) error, which substantially simplifies parameter tuning over the computationally more costly alternative of cross-validation.
We implemented VariantSpark RF in scala as it is the most performant interface languages to Apache Spark. Also, new updates to Spark and the interacting APIs will be deployed in scala first, which has been important when working on top of a fast evolving framework.
Give it a read. Thankfully, I exhibit few of the traits of the degenerative disease known as Hipsterism.
Distribute R computations using
spark_apply()to execute arbitrary R code across your Spark cluster. You can now use all of your favorite R packages and functions in a distributed context.
Connect to External Data Sources using
I’ve been impressed with sparklyr so far.
Each of these people can be represented as points in a 3 Dimensional space. With a gross approximation, each people is in a 50*50*200 (cm) cube. If we use a resolution of 1cm and three color channels, then can be represented by 1,000,000 variables.
On the other hand, the shadow is only in 2 dimensions and in black and white, so each shadow only needs 50*200=10,000 variables.
The number of variables was divided by 100 ! And if your goal is to detect human vs cat, or even men vs women, the data from the shadow may be enough.
Read on for intuitive discussions of techniques like principal component analysis and linear discriminant analysis. H/T R-Bloggers
Space got tight on a drive and I knew that there was space on another drive. I had already set this particular database with multiple secondary file groups/files (.ndf files) instead of a huge and single .mdf file (which would have made the whole thing a lot harder).
The procedure to relocate data files is extremely simple
Click through for Jana’s seven salvos of administrator success.
By default, ReadyRoll will ignore tSQLt objects, including our tests. We don’t want ReadyRoll to script out the tSQLt objects, but we do want it to script our tests. To set our filter we need to unload the project in VS and edit the project file. Add the following to the section named ReadyRoll Script Generation Section:
James’s series is really coming together at this point, so if you haven’t been reading, check out the links in his post.
On July 2017’s event the proposed topic was aimed for all you to share those little secrets that made your tummy burn after pressing F5.
Since early in the morning I’ve been reading your posts which makes me very happy and feel the topic was certainly well accepted by the community.
In order of published date these are the posts that took part in this month’s event.
Click through to see the 17 entries this month.
I have written articles before about how you can extract measures from a data model using DAX Studio and also using Power Pivot Utilities. These are both excellent tools in their own right and I encourage you to read up on those previous articles to learn more about these tools. Today however I am going to share another way you can extract a list of measures from an Excel Power Pivot Workbook without needing to install either of these 2 (excellent) software products. I often get called in to help people with their workbooks and sometimes they don’t have the right software installed for me to extract the list of measures (ie DAX Studio or PPU). This article explains how to extract the measures quickly without installing anything.
Matt uses a simple SQL statement to pull measure data into an Excel table, making it easy to retain the set of measures. There are some built-in documentation possibilities with this.
SQL Server 2017 RC1 dropping recently reminded me of a couple things I wanted to blog about finding in there. One that I thought was rather interesting is a new iTVF called
dm_os_enumerate_filesystem. It looks like a partial replacement for
xp_cmdshellin that, well, you can do the equivalent of running a
dircommand, with some filtering.
The addition of a search filter is particularly nice, since the
dircommand isn’t exactly robust in that area. If you’ve ever wanted to filter on a date, well… That’s probably when PowerShell got invented.
If I run a simple call to the new function like so…
Click through to see Erik use the new function.
The last time we were together, we learned how to use the MERGE statement when we wanted to insert rows that didn’t exist and update rows that didn’t. This time we’re going to add onto that. We’re adding the seldom used, but delightfully potent – delete rows that no longer exist in the original table.
MERGE is an enticing but dangerous piece of syntax. It looks so nice until you realize how many bugs and oddities there are in the command.