Hadoop For .NET Developers

Kevin Feasel



Elton Stoneman has a new Pluralsight course out:

My latest Pluralsight course is out now:

Hadoop for .NET Developers

It takes you through running Hadoop on Windows and using .NET to write MapReduce queries – proving that you can do Big Data on the Microsoft stack.

The course has five modules, starting with the architecture of Hadoop and working through a proof-of-concept approach, evaluating different options for running Hadoop and integrating it with .NET.

I’ve liked Elton’s courses, as he’s one of the few trainers who really takes the time to show how you can integrate .NET languages into a Hadoop ecosystem; the general philosophy is “go learn Java and Scala and Python and …”

Self-Paced HDInsight Training

Ashish Thapliyal introduces three EdX courses on HDInsight:

Implementing Real-Time Analysis with Hadoop in Azure HDInsight

Start course

In this four week course, you’ll learn how to implement low-latency and streaming Big Data solutions using Hadoop technologies like HBase, Storm, and Spark on Microsoft Azure HDInsight.

Course Syllabus

Use HBase to implement low-latency NoSQL data stores.
Use Storm to implement real-time streaming analytics solutions.
Use Spark for high-performance interactive data analysis.

These are free courses on EdX.  I personally wouldn’t bother getting the certificate, but hey, it’s your money.

Hortonworks HDP 2.5 Available

Hortonworks has a new version of their data platform, 2.5:

We are very pleased to announce that the Hortonworks Data Platform (HDP) Version 2.5 is now generally available for download. As part of a Open and Connected Data Platforms offering from Hortonworks, HDP 2.5 brings a variety of enhancements across all elements of the platform spanning data science, data access to security to governance.

At Hadoop Summit 2016 San Jose on 06/28/2016, we unveiled the latest innovation package within Hortonworks Data Platform 2.5.

The top points of interest:  Spark 2, Kafka 0.10.0, Ambari 2.4, and Storm 1.0.1.  These are four big projects with major improvements.  Looks like I’ve got something to do this weekend…

Turbo LogShip

Richie Lee announces a tool to make log shipping more powerful:

To resolve this, you can restore files under NORECOVERY, then switch to STANDBY: when restoring a log backup, you have two restore choices: NORECOVERY and STANDBY. Both these choices will allow further log restores, but STANDBY is the option to choose if you want the database to be read-only. NORECOVERY leaves the database in a transactionally inconsistent state: it does not roll back uncommitted transactions into a tuf file. So it is possible to restore the log files in NORECOVERY mode, and then restore a final log with the STANDBY option to enable the database to be read-only (it is pretty neat that you can switch between STANDBY and NORECOVERY in this way.) We can do this because we honestly don’t care about all those in-between restores being transactionally consistent. Sadly, this option is not an out-the-box operation, and so requires writing a custom job to restore the log files. I’ve read online a few methods to achieve this, and I have written my own custom restore process.

Check out Richie’s project on GitHub.

Benchmarking Azure SQL Database Wait Stats

John Sterrett explains wait stats and which stats are most important for Azure SQL Database:

With an instance of SQL Server regardless of using IaaS or on-premise, you would want to focus on all the waits that are occurring in your instance because the resources are dedicated to you.

In database as a service (DaaS), Microsoft gives you a special DMV that makes troubleshooting performance in Azure easier than any other competitor.  This feature is the dm_db_wait_stats DMV.  This DMV allows us specifically to get the details behind why our queries are waiting within our database and not the shared environment.  Once again it is worth repeating, wait statistics for our database in a shared environment.

Click through for a stored procedure John has provided to collect wait stats in a Waits schema.

Visualizing NFL Data

Kevin Feasel



Allison Tharp looks at NFL play-by-play data using R:

Lets look at how teams played on offense depending on where they were on the field (their yardline) and the down they were on.  The fields in our dataframe that we will care about here are yfog (yards from own goal), type (rush or pass), dwn (current down number: 1,2,3, or 4).  We will want a table with each of these columns as well as a sum column.  That way, we can see how many times a pass attempt was done on the 4th down when a team was X yards from their own goal.

To do this, we will use a package called plyr.  The Internet says that this package makes it easy for us to split data, mess with it, and then put it back together.  I am not convinced the tool is easy, but I haven’t spent too much time with it.

Check it out for some ideas on what you can do with R.


Kevin Feasel



David Kun introduces the R Database Layer:

It is important to note that the SQL statements generated in the background are not executed unless explicitly requested by the command as.data.frame. Hence, you can merge, filter and aggregate your dataset on the database side and load only the result set into memory for R.

In general the design principle behind RDBL is to keep the models as close as possible to the usual data.frame logic, including (as shown later in detail) commands like aggregate, referencing columns by the \($\) operator and features like logical indexing using the \([]\) operator.

Check it out.  I’m not particularly excited about this for one simple reason:  SQL is a better data retrieval and connection DSL than an R-based mapper.  I get the value of sticking to one language as much as possible.  I also get that not all queries need to be well-optimized—for example, you might be running queries on a local machine or against a slice of data which is not connected to an operational production environment.  But I’m a big fan of using the right tool for the job, and the right tool for working with relational databases (and the “relational” part there is perhaps optional) is SQL.

Biml And Metadata

Ben Weissman provides an example of using metadata to drive conditional data loading:

Now that we’ve defined connections, databases and schemas we still need to add our table metadata.

We’re going to do that by looping across all our databases marked as a source in Biml, retrieving the list of required tables from SQL (located in View vMyBimlMeta_Tables) and creating a table tag for each table which will also reference back to the corresponding target system. That also allows us to use the same table names multiple times. Again, we’ll store some additional data in annotations.

This is an interesting concept.  Check it out.

Index Row Sizes

Kendra Little explains the rules behind how large a non-clustered index row can be:

So make sure you really need all that junk in your nonclustered index trunk. Er, key.

But even with the expanded size of key columns, sometimes I get asked a question: do columns that “secretly” get added to the key of a nonclustered index count against the maximum allowed nonclustered index key length?

You can read the short answer, but I recommend reading the full post.

Budapest satRday

Kevin Feasel



The first satRdays event will take place in Budapest on September 3rd:

This is a very exciting project with great interest from the R and more general data science community — in the past short 2 months (since we opened registration for the conference):

  • More than 160 persons signed up and paid for attendance from 17 countries so far (around 50-50% mix of academic and industry tickets, 30-70% mix of foreign and Hungarian attendees)

  • We received almost 40 voluntary talk proposals in a few weeks of time while the CfP was open

  • 25 selected & awesome speakers agreed on to present  at the conference

I’d like to see this take off, similar to SQL Saturdays.


August 2016
« Jul Sep »