Distributed Unit Testing

Cloudera shows off their distributed unit testing framework:

This distributed testing infrastructure started out as a Cloudera hackathon project in 2014. Todd Lipcon and I worked on a shared backend for running test tasks on a cluster, with Todd focusing on onboarding the Apache Kudu (incubating) tests, and myself on Apache Hadoop. Our prototype implementation reduced the runtime of the 1,700+ Hadoop unit tests from 8.5 hours to 15 minutes.

Since then, we’ve spent time improving the infrastructure and on-boarding additional projects. Besides Kudu and Hadoop, our distributed testing infrastructure is also being used by our Apache Hive and Apache HBase teams. We can now run all the Hadoop unit tests in less than 10 minutes!

Finally, we’re happy to announce that both our infrastructure and code are public! You can browse the webUI at http://dist-test.cloudera.org and see all the source code (ASLv2 licensed) at the cloudera/dist_test github repository. This infrastructure is already being used at upstream Apache to run the Kudu pre-commit tests.

This is an interesting look at how to scale out unit tests.  It’s a bit of a long read (especially with all the videos) but worth your time.

Microsoft Atop Hadoop Cloud Solutions

Forrester has named Microsoft a leader in the Hadoop cloud solutions space:

This week, we’re excited that Forrester recognized Microsoft Azure as a leader in their Big Data Hadoop Cloud Solutions. Apache Hadoop as a technology has become popular amongst organizations to unlock insights from data of all size, shape, and speed. Hadoop power solutions to help businesses improve their performance, educators to better connect with the needs of their students, medical professionals to improve the quality of their care, or researchers to accelerate new advancements in science.

As an example, Ultra Tendency uses Hadoop to achieve something not possible before – visualize more than 27 million distinct sensor readings to give Japanese citizens accurate, up-to-date information about the radiation contamination from the Fukushima nuclear plant meltdown. More and more organizations are also deploying Hadoop in the cloud with 47% of Forrester’s respondents to a 2015 survey increasing their cloud deployments either by 5-10% (37%) or more than 10% (10%).1 This makes sense because the cloud allows you to scale elastically on demand to handle the processing of any amount of data.

AWS and IBM also have very good solutions, and Google is trying to get a stronger foothold on the cloud game.

HBase’s Failure To Catch On

Kevin Feasel



Matt Asay has an interesting article on how HBase started as a big thing but has fizzled since:

Ex-Googler (and current Amazon Web Services employee) Tim Bray argues “there is a real cost to this continuous widening of the base of knowledge a developer has to have to remain relevant.” RedMonk analyst Stephen O’Grady takes this a step further: “It could be that we’re approaching the too-much-of-a-good-thing stage. In which case, the logical outcome will be a gradual slowing of fragmentation followed by gradual consolidation.”

In other words, niche data stores that do one thing really well are giving way to more generally applicable databases that can serve a broader range of enterprise needs.

The second part of Keep’s sentence above, however, spells out another reason HBase is struggling: It’s really hard to use.

I have a statement which is 90% serious and 10% joke:  a database product is truly mature once it supports SQL.  So what’s the answer for HBase?  The current attempt at an answer is Phoenix, which is…SQL for HBase.

Hortonworks Revenue Growth

Kevin Feasel



Alex Woodie reports that Hortonworks has seen its revenue grow 85% 1Q YOY:

Support subscription revenue during the quarter was up sharply from $13.1 million to $27.6 million, an increase of 110 percent compared to the first quarter of 2015, which was Hortonworks’ first quarter as a public company following an IPO in late 2014. Professional services revenue accounted for $13.7 million in revenue, a 49 percent increase.

Hortonworks holds about 40% of the Hadoop market share, with Cloudera holding another 40%.

Hadoop And SQL Server Are Complements

Kevin Feasel



Jim Scott explains that Hadoop and relational databases solve different problems:

That’s the basics. Peeling back the onion more reveals other distinct differences, further making the case more strongly for a Hadoop-RDBMS coexistence strategy. RDBMS has the backing of the biggest names in the software industry, and as such has fostered an install base of IT talent probably second to none. RDBMS integrate very well with other systems, and represent a very mature technology having venerable, 40-year old roots. RDBMS are baked into the very fabric of just about every mid-to large sized IT organization in the world. Believe it – RDBMS aren’t going away any time soon, nor should they.

Relational databases have a strong mathematical footing which provides unparalleled data integrity.  Hadoop has a strong mathematical footing which provides near-linear scale out.  The key is knowing the problem you need to solve and how to integrate the results.

Recalculating Days

Brian Mitchell shows how to re-calculate prior days in Azure Data Lake using partitioning:

The question is what is the right time period to use? The answer is it depends on the size of your partitions.  Generally, for managed tables in U-SQL, you want to target about 1 GB per partition.  So, if you are bringing in say 800 mb per day then daily partitions are about right.  If instead you are bringing in 20 GB per day, you should look at hourly partitions of the data.

In this post, I’d like to take a look at two common scenarios that people run into.  The first is full re-compute of partitions data and the second is a partial re-compute of a partition.  The examples I will be using are based off of the U-SQL Ambulance Demo’s on Github and will be added to the solution for ease of your consumption.

The ability to reprocess data is vital in any ETL or ELT process.

Simplifying Spark Application Development

Ian Hellstrom has scripts to simplify Apache Spark application rollout:

When creating Apache Spark applications the basic structure is pretty much the same: for sbt you need the same build.sbt, the same imports, and the skeleton application looks the same. All that really changes is the main entry point, that is the fully qualified class. Since that’s easy to automate, I present a couple of shell scripts that help you create the basic building blocks to kick-start Spark application development and allow you to easily upgrade versions in the configuration.

Check these out if you’re interested in Spark.

Architecting Semi-Structured Data Solutions

James Serra gives four architectural scenarios for handling large quantities of semi-structured data:

An evolution of the three previous scenarios that provides multiple options for the various technologies.  Data may be harmonized and analyzed in the data lake or moved out to a EDW when more quality and performance is needed, or when users simply want control.  ELT is usually used instead of ETL (see Difference between ETL and ELT).  The goal of this scenario is to support any future data needs no matter what the variety, volume, or velocity of the data.

Hub-and-spoke should be your ultimate goal.  See Why use a data lake? for more details on the various tools and technologies that can be used for the modern data warehouse.

Check it out for a high-level architectural view of contemporary warehousing choices.  I prefer having both systems in play:  the EDW answers known business questions and gives you back report information relatively quickly; whereas the Hadoop cluster allows you to do spelunking, data cleansing, and answer unanticipated business questions.


Ayman El-Ghazali talks Polybase:

HDFS is a distributed file system that works differently than what we’re used to in the Windows OS side of things; the general principle is to use cheap commodity hardware that replicates data in order to account for availability and to prevent loss of data. With that in mind, it makes a great use case to store a lot of data cheaply for archiving purposes or can be used to store large quantities of data that been to be processed in large quantities as well.

For more information please visit: https://msdn.microsoft.com/en-us/library/mt143171.aspx

Now if you want to try it out for yourself, make sure you install the PolyBase Engine (from the SQL Server setup) and feel free to try the modified code sample below.

Polybase is, without a doubt, my favorite SQL Server 2016 feature.  I am excited to put this through its paces in a production environment.

Spark + R Webinar

Kevin Feasel


Hadoop, R, Spark

David Smith points out a recent webinar on combining Microsoft R Server with HDInsight:

As Mario Inchiosa and Roni Burd demonstrate in this recorded webinar, Microsoft R Server can now run within HDInsight Hadoop nodes running on Microsoft Azure. Better yet, the big-data-capable algorithms of ScaleR (pdf) take advantage of the in-memory architecture of Spark, dramatically reducing the time needed to train models on large data. And if your data grows or you just need more power, you can dynamically add nodes to the HDInsight cluster using the Azure portal.

I don’t normally link to webinars (because they tend to violate my “should be viewable in a coffee break” rule of thumb) but I have a soft spot in my heart for these technologies.  If you want to dig into more “mainstream” (off the Microsoft platform) Spark + R fun, check out SparkR.


August 2017
« Jul