When Not to Use Spark

Ramandeep Kaur gives us several cases when it makes sense not to use Apache Spark:

There can be use cases where Spark would be the inevitable choice. Spark considered being an excellent tool for use cases like ETL of a large amount of a dataset, analyzing a large set of data files, Machine learning, and data science to a large dataset, connecting BI/Visualization tools, etc.
But its no panacea, right?

Let’s consider the cases where using Spark would be no less than a nightmare.

No tool is perfect at everything. Click through for a few use cases where the Spark experience degrades quickly.

Hyperparameter Tuning with MLflow

Joseph Bradley shows how you can perform hyperparameter tuning of an MLlib model with MLflow:

Apache Spark MLlib users often tune hyperparameters using MLlib’s built-in tools CrossValidator and TrainValidationSplit.  These use grid search to try out a user-specified set of hyperparameter values; see the Spark docs on tuning for more info.

Databricks Runtime 5.3 and 5.3 ML and above support automatic MLflow tracking for MLlib tuning in Python.

With this feature, PySpark CrossValidator and TrainValidationSplit will automatically log to MLflow, organizing runs in a hierarchy and logging hyperparameters and the evaluation metric.  For example, calling CrossValidator.fit() will log one parent run.  Under this run, CrossValidator will log one child run for each hyperparameter setting, and each of those child runs will include the hyperparameter setting and the evaluation metric.  Comparing these runs in the MLflow UI helps with visualizing the effect of tuning each hyperparameter.

Hyperparameter tuning is critical for some of the more complex algorithms like random forests, gradient boosting, and neural networks.

TensorFrames: Spark Plus TensorFlow

Adi Polak gives us an introduction to TensorFrames:

In all TensorFrames functionality, the DataFrame is sent together with the computations graph. The DataFrame represents the distributed data, meaning in every machine there is a chunk of the data that will go through the graph operations/ transformations. This will happen in every machine with the relevant data. Tungsten binary format is the actual binary in-memory data that goes through the transformation, first to Apache Spark Java object and from there it is sent to TensorFlow Jave API for graph calculations. This all happens in the Spark Worker process, the Spark worker process can spin many tasks which mean various calculation at the same time over the in-memory data.

An interesting bit of turnabout here is that the Scala API is the underdeveloped one; normally for Spark, the Python API is the Johnny-Come-Lately version.

MLflow 1.0 Released

Clemens Mewald and Matei Zaharia announce the release of MLflow 1.0:

Today we are excited to announce the release of MLflow 1.0. Since its launch one year ago, MLflow has been deployed at thousands of organizations to manage their production machine learning workloads, and has become generally available on services like Managed MLflow on Databricks. The MLflow community has grown to over 100 contributors, and the MLflow PyPI package download rate has reached close to 600K times a month. The 1.0 release not only marks the maturity and stability of the APIs, but also adds a number of frequently requested features and improvements.

The release is publicly available starting today. Install MLflow 1.0 using PyPl, read our documentation to get started, and provide feedback on GitHub. Below we describe just a few of the new features in MLflow 1.0. Please refer to the release notes for a full list.

And it looks like they’re going to keep pushing on it from there.

Connecting PolyBase to Spark

I have a blog post connecting PolyBase to a Spark cluster:

If you do define your Spark DataFrames well, you get a much happier result. Here’s me creating a better-looking DataFrame in Spark:

import org.apache.spark.sql.functions._
INT(SUMLEV) AS SummaryLevel,
POPTYPE AS PopulationType,
FROM NorthCarolinaPopulation

It’s not all perfect, though: I also cover driver problems that I ran into here with Spark and Hive.

An Introduction to Azure Databricks

Brad Llewellyn has an introduction to Azure Databricks:

So, what is Azure Databricks?  To answer this question, let’s start all the way at the bottom of the hole and climb up.  So, what is Hadoop?  Apache Hadoop is an open-source, distributed storage and computing ecosystem designed to handle incredibly large volumes of data and complex transformations.  It is becoming more common as organizations are starting to integrate massive data sources, such as social media, financial transactions and the Internet of Things.  However, Hadoop solutions are extremely complex to manage and develop.  So, many people have worked together to create platforms that layer on top of Hadoop to provide a simpler way to solve certain types of problems.  Apache Spark is one of these platforms.  You can read more about Apache Hadoop here and here.

It’s Hadoop turtles all the way down.

Using Notebooks with ElasticMapReduce

Vignesh Rajamani and Nikki Rouda show off ElasticMapReduce Notebooks:

One of the useful features of EMR Notebooks is the separation of the notebook environment from your underlying cluster infrastructure. The separation makes it easy for you to execute notebook code against transient clusters without worrying about deploying or configuring your notebook infrastructure every time you bring up a new cluster. You can create multiple serverless notebooks from the AWS Management Console for EMR and access the notebook UI without spending time setting up SSH access or configuring your browser for port-forwarding. Each notebook you create is launched instantly with its own Spark context. This capability enables you to attach multiple notebooks to a single shared cluster and submit parallel jobs without fear of job conflicts in a multi-tenant environment. This way you make efficient use of your clusters.

You can also connect EMR Notebooks to an EMR cluster as small as a one node. This gives you a budget-friendly sandbox environment to develop your Spark application.

Notebooks are everywhere. And for good reason.

From pandas to Spark with koalas

Achilleus tries out Koalas:

Python is widely used programming language when it comes to Data science workloads and Python has way too many different libraries to back this fact. Most of the data scientists are familiar with Python and pandas mostly. But the main issue with Pandas is it works great for small and medium datasets but not so good on big-data workloads. The challenge now becomes to convert the existing pandas code to pyspark code. This is just not straight forward and has a lot of performance hits if python UDFs are used without much care.

Koalas tries to address the first problem ie lessen the friction of learning different APIs to port their existing Pandas code to Pyspark. With Koalas, we can just directly replace the existing pandas code with Koalas. As far as the performance goes, there are no numbers yet as it is still in the initial phase of the project. But this definitely looks promising though.

Read on for some initial thoughts on the product, including a few gotchas.

Overriding Spark Dependencies

Landon Robinson shows how to override a Spark dependency located on the classpath:

This doesn’t draw the line exactly where the method changed from private to public, but generally speaking:
– gson-2.2.4.jar: the method is private, and therefore too old for use here
– gson-2.6.1: the method is public, and works fine.
Somewhere between the two, the method’s status changed.

So, because I had some functionality that required the method be public and accessible, it was important I specify the right version in my dependency manager (SBT). “That’s easy,” I thought. “No problem.”

Spoilers: there was a problem.

ACID Transactions on Spark

Achilleus explains one of the big announcements at Spark+AI Summit 2019:

Delta Lake is basically a compute layer that would sit on top of your existing On Prem HDFS cluster, your favourite Cloud storage or even run it locally on your laptop(Best part)! Data is stored on the above-mentioned storage as versioned Parquet files. Any data that is read using Spark can be used to read and write with Delta Lake. Delta lakes provides an unified platform to support both Batch Processing and Stream processing workloads on a single platform.

Read on to understand just how useful this is.


June 2019
« May