Dr. Elephant: Where Does My Hadoop Cluster Hurt?

Carl Steinbach looks back at Dr. Elephant one year later:

What we needed to introduce to the job-tuning equation was a series of questions like those asked by a physician making a diagnosis: a step-by-step process that guides the user through the problem-solving process, while also educating them at the same time.

So we created Dr. Elephant, a system that automatically detects under-performing jobs, diagnoses the root cause, and guides the owner of the job through the treatment process. Dr. Elephant makes it easy to identify jobs that are wasting resources, as well as jobs that can achieve better performance without sacrificing efficiency. Perhaps most importantly, Dr. Elephant makes it easy to act on these insights by making job-level performance tuning accessible to users regardless of their previous skill level. In the process, Dr. Elephant has helped to ease the tension that previously existed between user productivity on one side and cluster efficiency on the other.

LinkedIn has made this project open source if you want to check it out in your environment.

TensorFlow With YARN

Wangda Tan and Vinod Kumar Vavilapalli show how to control TensorFlow jobs with YARN:

YARN has been used successfully to run all sorts of data applications. These applications can all coexist on a shared infrastructure managed through YARN’s centralized scheduling.

With TensorFlow, one can get started with deep learning without much knowledge about advanced math models and optimization algorithms.

If you have GPU-equipped hardware, and you want to run TensorFlow, going through the process of setting up hardware, installing the bits, and optionally also dealing with faults, scaling the app up and down etc. becomes cumbersome really fast. Instead, integrating TensorFlow to YARN allows us to seamlessly manage resources across machine learning / deep learning workloads and other YARN workloads like MapReduce, Spark, Hive, etc.

Read on for more details, including a demo video.

dplyr Tricks

Kevin Feasel



Bruno Rodrigues shares a few lesser-known tricks in R’s dplyr package:

Removing unneeded columns

Did you know that you can use - in front of a column name to remove it from a data frame?

There are a few good tricks in here.  H/T R Bloggers.

Rolling Out An Analytics Project

Christina Prevalsky shares some thoughts on considerations when implementing an analytics project:

The earlier you address data quality the better; the less time your end users spend on data wrangling, and the more they can focus on high value analytics. As your organization’s data infrastructure matures, migrating from spreadsheets to databases and data warehouses, data quality checks should be formally defined, documented, and automated. Exceptions should either be handled automatically during data intake using predefined business rules logic or require immediate user intervention to correct any errors.

Providing clean, centralized, and analytics-ready data to end users should not be a one-way process. By allowing end users to focus on high-value analytics, like data mining, network graphs, clustering, etc., they can uncover certain outliers and anomalies in the data. Effective data management should include a feedback loop to communicate these findings and, if necessary, incorporate any changes in the ETL processes, making centralized data management more dynamic and flexible.

The big question to ask is, “what problem are we trying to solve?”  That will help determine the answer to many of the questions, including how you store the data, how you expose the data, and even which data you collect and keep.

Air Travel Route Maps With ggplot2

Peter Prevos wants to create a pretty map of flights he’s taken:

The first step was to create a list of all the places I have flown between at least once. Paging through my travel photos and diaries, I managed to create a pretty complete list. The structure of this document is simply a list of all routes (From, To) and every flight only gets counted once. The next step finds the spatial coordinates for each airport by searching Google Maps using the geocode function from the ggmap package. In some instances, I had to add the country name to avoid confusion between places.

The end result is imperfect (as Peter mentions, ggmap isn’t wrapping around), but does fit the bill for being eye-catching.

Continuous Deployment In A Box

Ed Elliott has been working on a very interesting project:

What does this do?

Unblock-File *.ps1 – removes a flag that windows puts on files to stop them being run if they have been downloaded over the internet.
.\ContinuousDeploymentFTW.ps1 – runs the install script which actually:

  • Downloads chocolatey
  • Installs git
  • Installs Jenkins 2
  • Guides you how to configure Jenkins
  • Creates a local git repo
  • Creates a SSDT project which is configured with a test project and ssdt and all the references that normally cause people problems
  • Creates a local Jenkins build which monitors your local git repo for changes
  • When code is checked into the repo, the Jenkins job jumps into action and…

If you check into the default branch “master” then Jenkins:

  • Builds the SSDT project
  • Deploys the project to the unit test database
  • Runs the tSQLt unit tests
  • Generates a deployment script for the “production” database

and what you have there is continuous delivery in a box

Click through for a video where Ed shows how it all works.

Replication Extended Events

Drew Furgiuele goes hunting for the most dangerous creature of all, replication-related extended events:

Extended events are great; they have all the goodness of profiler except you don’t use profiler. Win/win! More to the point, extended events let you quickly and easily view, sort, and aggregate events that occur on your instances. They also have powerful filters (really, a “where” clause) to limit noise. You have way more control over what you monitor, how you store the data, and how you view and use it. This makes them perfect use to track replicated transactions, since we want to measure at both an individual level and the aggregate.

I fired up management studio and went to “New Session” looking for some replication event goodness and I found…

… nothing. I tried looking for events that had even parts of the name replication in it. No such thing, apparently.

This doesn’t deter Drew and he ends up building some interesting events to infer the correct answers.

Ignoring LoadGeneratorLocationError

Melissa Connors shows how to ignore LoadGeneratorLocationError errors in Visual Studio load tests:

I use Visual Studio for performance testing and overhead analysis with the SentryOne products. Currently, I have Microsoft Visual Studio Enterprise 2015 Version 14.0.25431.01 Update 3 installed. Since the first edition of 2015 (possibly even Visual Studio 2013), I’ve received a LoadGeneratorLocationError during each Load Test execution.

Since I am running the test locally, this error is noise. Furthermore, no one wants to see an error in an otherwise successful test. It simply ruins the final results report. In addition, when the Load Test was created, “On-premise Load Test” was selected, which makes this frustrating. Possibly more frustrating is that it’s called “On-premise” when you get started in the New Load Test Wizard.

Read on for the answer.

Unnecessary, Mandatory Work

Lukas Eder lays out one of the biggest performance drains today:

We’re using 8x as much memory in the database when doing SELECT * rather than SELECT film, rating. That’s not really surprising though, is it? We knew that. Yet we accepted it in many many of our queries where we simply didn’t need all that data. We generated needless, mandatory work for the database, and it does sum up. We’re using 8x too much memory (the number will differ, of course).

Now, all the other steps (disk I/O, wire transfer, client memory consumption) are also affected in the same way, but I’m skipping those.

This article is absolutely worth reading and sharing with developers.

Finding Physical Row Location

Wayne Sheffield shows how to find the physical location of a row in SQL Server:

Acquiring the physical location of a row

SQL Server 2008 introduced a new virtual system column: “%%physloc%%”. “%%physloc%%” returns the file_id, page_id and slot_id information for the current row, in a binary format. Thankfully, SQL Server also includes a couple of functions to split this binary data into a more useful format. Unfortunately, Microsoft has not documented either the column or the functions.

Read on for two functions you can use to format this data more nicely, as well as a short re-write Wayne did to improve performance of one of them.


March 2017
« Feb Apr »