Auto ML With SQL Server 2019 Big Data Clusters

Marco Inchiosa has a model scenario for using Big Data Clusters to scale out a machine learning problem:

H2O provides popular open source software for data science and machine learning on big data, including Apache SparkTM integration. It provides two open source python AutoML classes: h2o.automl.H2OAutoML and Both APIs use the same underlying algorithm implementations, however, the latter follows the conventions of Apache Spark’s MLlib library and allows you to build machine learning pipelines that include MLlib transformers. We will focus on the latter API in this post.

H2OAutoML supports classification and regression. The ML models built and tuned by H2OAutoML include Random Forests, Gradient Boosting Machines, Deep Neural Nets, Generalized Linear Models, and Stacked Ensembles.

The post only has a few lines of code but there are a lot of working parts under the surface.

Erasure Coding In Hadoop

Guy Shilo explains erasure coding, a new feature in Hadoop 3:

The benefits are, of course, space-saving, and for large files also improved performance (blocks striped across datanodes can be read in parallel, and less blocks are written because there is no x3 replication). The larger the file the more notable is the performance gain.

Erasure encoding is disabled by default and you can enable it for only certain directories in HDFS. Some articles like this one suggest thatbest practice is to enable Erasure coding only for “cold” data that you do not write often, and for “hot” data use regular replication. However, in my tests I did not witness any problem dealing with hot data (maybe it’s evident in larger scales).

Click through for the full story on how it works.

Converting CSV To ORC

Mark Litwintschik investigates whether Spark is faster at converting CSV files to ORC format than Hive or Presto:

Spark, Hive and Presto are all very different code bases. Spark is made up of 500K lines of Scala, 110K lines of Java and 40K lines of Python. Presto is made up of 600K lines of Java. Hive is made up of over one million lines of Java and 100K lines of C++ code. Any libraries they share are out-weighted by the unique approaches they’ve taken in the architecture surrounding their SQL parsers, query planners, optimizers, code generators and execution engines when it comes to tabular form conversion.

I recently benchmarked Spark 2.4.0 and Presto 0.214 and found that Spark out-performed Presto when it comes to ORC-based queries. In this post I’m going to examine the ORC writing performance of these two engines plus Hive and see which can convert CSV files into ORC files the fastest.

The results surprised me.

Cloudera Data Platform

Kevin Feasel



Alex Woodie reports on the new Cloudera’s business plan:

“Once we’ve delivered that and got past it, we then want to get to a second subsequent version, which you can start to upgrade and migrate to, and that will be the go-forward platform,” he said. “Obviously the key part of CDP is delivering not just the workloads you have today but new and intuitive experiences around key workloads such as data warehousing, data flow, the edge or streaming, AI and machine learning.”
The company also announced that CDH 5.x and 6.x and HDP 3.x will be supported through January 2022, which is in-line with previous guidance the company has given. This company believes that three years is plenty of time for customers to plan their migration paths from older CDH and HDP versions to the unified CDP product. Support for HDP 2.x will end before that time.

Also of interest: the integration of Hortonworks Data Flow into CDH and Cloudera Data Science Workbench into HDP.

Converting Spark RDDs To DataFrames

Franco Cano shows us how we can convert a Resilient Distributed Dataset in Spark to a DataFrame:

Sometimes you need transform you RRDs to DataFrames because DataFrames have a lot optimization options.
Let’s see how this is done.

This works, but there is a performance cost to converting a large RDD to a DataFrame (or vice versa). With that in mind, sticking to one type when you can is typically better.

Data Transformation Tools In The Azure Space

James Serra gives us an overview of the major tools you would use for ETL and ELT in Azure:

If you are building a big data solution in the cloud, you will likely be landing most of the source data into a data lake.  And much of this data will need to be transformed (i.e. cleaned and joined together – the “T” in ETL).  Since the data lake is just storage (i.e. Azure Data Lake Storage Gen2 or Azure Blob Storage), you need to pick a product that will be the compute and will do the transformation of the data.  There is good news and bad news when it comes to which product to use.  The good news is there are a lot of products to choose from.  The bad news is there are a lot of products to choose from :-).  I’ll try to help your decision-making by talking briefly about most of the Azure choices and the best use cases for each when it comes to transforming data (although some of these products also do the Extract and Load part

The only surprise is the non-mention of Azure Data Lake Analytics, and there is a good conversation in the comments section explaining why.

Apache Airflow Now A Top-Level Project

Fokko Driesprong announces that Apache Airflow is now a top-level Apache project:

Today is a great day for Apache Airflow as it graduates from incubating status to a Top-Level Apache project. This is the next step of maturity for Airflow. For those unfamiliar, Airflow is an orchestration tool to schedule and orchestrate your data workflows. From ETL to training of models, or any other arbitrary tasks. Unlike other orchestrators, everything is written in Python, which makes it easy to use for both engineers and scientists. Having everything in code means that it is easy to version and maintain.

Airflow has been getting some hype lately, especially in the AWS space.

Databricks Library Utilities For Notebooks

Srinath Shankar and Todd Greenstein announce a new feature in Databricks Runtime 5.1:

We can see that there are no libraries installed and scoped specifically to this notebook.  Now I’m going to install a later version of SciPy, restart the python interpreter, and then run that same helper function we ran previously to list any libraries installed and scoped specifically to this notebook session. When using the list() function PyPI libraries scoped to this notebook session are displayed as  <library_name>-<version_number>-<repo>, and (empty) indicates that the corresponding part has no specification. This also works with wheel and egg install artifacts, but for the sake of this example we’ll just be installing the single package directly.

This does seem easier than dropping to a shell and installing with Pip, especially if you need different versions of libraries.

A Compendium Of Kafka Links

Kevin Feasel



Manas Dash shares some interesting Kafka-related articles, case studies, and books:

1. Kafka in a Nutshell. Published on September 25, 2015, by Kevin Sookocheff. Kevin’s article is all about Kafka in a nutshell. He says “Kafka is quickly becoming the backbone of many organization’s data pipelines — and with good reason. By using Kafka as a message bus we achieve a high level of parallelism and decoupling between data producers and data consumers, making our architecture more flexible and adaptable to change.” If you have not read about Kafka yet, you must go through it. This is more like an executive summary of the what, where, and why of Kafka.

Read on for several more articles, as well as a few case studies and two books.

Generating Test Data In Kafka

Yeva Byzek takes us through the Kafka Connect Datagen connector:

Short of using real data from a real source, you do have a few options on how to generate more interesting test data for your topics. One option is to write your own client. Kafka has many programming language options—you choose: Java, Python, Go, .NET, Erlang, Rust—the list goes on. You can write your own Kafka client applications that produce any kind of records to a Kafka topic, and then you’re set.
But wouldn’t it be great if you could generate data locally to just fill topics with messages? Fortunately, you’re in luck! Because we have those data generators.

Click through for a demonstration.


February 2019
« Jan