There are three major concepts for us to understand about Azure Databricks, Clusters, Code and Data. We will dig into each of these in due time. For this post, we’re going to talk about Clusters. Clusters are where the work is done. Clusters themselves do not store any code or data. Instead, they operate the physical resources that are used to perform the computations. So, it’s possible (and even advised) to develop code against small development clusters, then leverage the same code against larger production-grade clusters for deployment. Let’s start by creating a small cluster.
Read on for an example.
We’ve partnered with the Data Services team at Amazon to bring the Glue Catalog to Databricks. Databricks Runtime can now use Glue as a drop-in replacement for the Hive metastore. This provides several immediate benefits:
– Simplifies manageability by using the same glue catalog across multiple Databricks workspaces.
– Simplifies integrated security by using IAM Role Passthrough for metadata in Glue.
– Provides easier access to metadata across the Amazon stack and access to data catalogued in Glue.
There are some interesting changes in here.
There can be use cases where Spark would be the inevitable choice. Spark considered being an excellent tool for use cases like ETL of a large amount of a dataset, analyzing a large set of data files, Machine learning, and data science to a large dataset, connecting BI/Visualization tools, etc.
But its no panacea, right?
Let’s consider the cases where using Spark would be no less than a nightmare.
No tool is perfect at everything. Click through for a few use cases where the Spark experience degrades quickly.
Apache Spark MLlib users often tune hyperparameters using MLlib’s built-in tools
TrainValidationSplit. These use grid search to try out a user-specified set of hyperparameter values; see the Spark docs on tuning for more info.
Databricks Runtime 5.3 and 5.3 ML and above support automatic MLflow tracking for MLlib tuning in Python.
With this feature, PySpark
TrainValidationSplitwill automatically log to MLflow, organizing runs in a hierarchy and logging hyperparameters and the evaluation metric. For example, calling
CrossValidator.fit()will log one parent run. Under this run,
CrossValidatorwill log one child run for each hyperparameter setting, and each of those child runs will include the hyperparameter setting and the evaluation metric. Comparing these runs in the MLflow UI helps with visualizing the effect of tuning each hyperparameter.
Hyperparameter tuning is critical for some of the more complex algorithms like random forests, gradient boosting, and neural networks.
In all TensorFrames functionality, the DataFrame is sent together with the computations graph. The DataFrame represents the distributed data, meaning in every machine there is a chunk of the data that will go through the graph operations/ transformations. This will happen in every machine with the relevant data. Tungsten binary format is the actual binary in-memory data that goes through the transformation, first to Apache Spark Java object and from there it is sent to TensorFlow Jave API for graph calculations. This all happens in the Spark Worker process, the Spark worker process can spin many tasks which mean various calculation at the same time over the in-memory data.
An interesting bit of turnabout here is that the Scala API is the underdeveloped one; normally for Spark, the Python API is the Johnny-Come-Lately version.
Today we are excited to announce the release of MLflow 1.0. Since its launch one year ago, MLflow has been deployed at thousands of organizations to manage their production machine learning workloads, and has become generally available on services like Managed MLflow on Databricks. The MLflow community has grown to over 100 contributors, and the MLflow PyPI package download rate has reached close to 600K times a month. The 1.0 release not only marks the maturity and stability of the APIs, but also adds a number of frequently requested features and improvements.
The release is publicly available starting today. Install MLflow 1.0 using PyPl, read our documentation to get started, and provide feedback on GitHub. Below we describe just a few of the new features in MLflow 1.0. Please refer to the release notes for a full list.
And it looks like they’re going to keep pushing on it from there.
If you do define your Spark DataFrames well, you get a much happier result. Here’s me creating a better-looking DataFrame in Spark:
INT(SUMLEV) AS SummaryLevel,
INT(COUNTY) AS CountyID,
INT(PLACE) AS PlaceID,
BOOLEAN(PRIMGEO_FLAG) AS IsPrimaryGeography,
NAME AS Name,
POPTYPE AS PopulationType,
INT(YEAR) AS Year,
INT(POPULATION) AS Population
POPULATION <> 'A'
It’s not all perfect, though: I also cover driver problems that I ran into here with Spark and Hive.
So, what is Azure Databricks? To answer this question, let’s start all the way at the bottom of the hole and climb up. So, what is Hadoop? Apache Hadoop is an open-source, distributed storage and computing ecosystem designed to handle incredibly large volumes of data and complex transformations. It is becoming more common as organizations are starting to integrate massive data sources, such as social media, financial transactions and the Internet of Things. However, Hadoop solutions are extremely complex to manage and develop. So, many people have worked together to create platforms that layer on top of Hadoop to provide a simpler way to solve certain types of problems. Apache Spark is one of these platforms. You can read more about Apache Hadoop here and here.
It’s Hadoop turtles all the way down.
One of the useful features of EMR Notebooks is the separation of the notebook environment from your underlying cluster infrastructure. The separation makes it easy for you to execute notebook code against transient clusters without worrying about deploying or configuring your notebook infrastructure every time you bring up a new cluster. You can create multiple serverless notebooks from the AWS Management Console for EMR and access the notebook UI without spending time setting up SSH access or configuring your browser for port-forwarding. Each notebook you create is launched instantly with its own Spark context. This capability enables you to attach multiple notebooks to a single shared cluster and submit parallel jobs without fear of job conflicts in a multi-tenant environment. This way you make efficient use of your clusters.
You can also connect EMR Notebooks to an EMR cluster as small as a one node. This gives you a budget-friendly sandbox environment to develop your Spark application.
Notebooks are everywhere. And for good reason.
Python is widely used programming language when it comes to Data science workloads and Python has way too many different libraries to back this fact. Most of the data scientists are familiar with Python and pandas mostly. But the main issue with Pandas is it works great for small and medium datasets but not so good on big-data workloads. The challenge now becomes to convert the existing
pysparkcode. This is just not straight forward and has a lot of performance hits if python UDFs are used without much care.
Koalas tries to address the first problem ie lessen the friction of learning different APIs to port their existing Pandas code to Pyspark. With Koalas, we can just directly replace the existing pandas code with Koalas. As far as the performance goes, there are no numbers yet as it is still in the initial phase of the project. But this definitely looks promising though.
Read on for some initial thoughts on the product, including a few gotchas.