Press "Enter" to skip to content

Category: Spark

Kryo Serialization in Spark

Pinku Swargiary shows us how to configure Spark to use Kryo serialization:

If you need a performance boost and also need to reduce memory usage, Kryo is definitely for you. The join operations and the grouping operations are where serialization has an impact on and they usually have data shuffling. Now lesser the amount of data to be shuffled, the faster will be the operation.
Caching also have an impact when caching to disk or when data is spilled over from memory to disk.

Also, if we look at the size metrics below for both Java and Kryo, we can see the difference.

Sounds like it’s better overall but requires some custom configuration.

Comments closed

Azure Databricks and Delta Lake

Brad Llewellyn starts a new series on Delta Lake in Azure Databricks:

Saving the data in Delta format is as simple as replacing the .format(“parquet”) function with .format(“delta”).  However, we see a major difference when we look at the table creation.  When creating a table using Delta, we don’t have to specify the schema, because the schema is already strongly defined when we save the data.  We also see that Delta tables can be easily queried using the same SQL we’re used to.  Next, let’s compare what the raw files look like by examining the blob storage container that we are storing them in.

There are some good demos in this post and it promises to be a nice series.

Comments closed

Converting Databricks Notebooks to ipynb

Dave Wentzel shows how we can convert a Databricks notebook (in DBC format) to a normal Jupyter notebook (in ipynb format):

Databricks natively stores it’s notebook files by default as DBC files, a closed, binary format. A .dbc file has a nice benefit of being self-contained. One dbc file can consist of an entire folder of notebooks and supporting files. But other than that, dbc files are frankly obnoxious.

Read on to see how to convert between these two formats.

Comments closed

JupyterLab Integration for Databricks

Bernhard Walter announces an integration between JupyterLab and Databricks:

This blog post starts with a quick overview how using a remote Databricks cluster from your local JupyterLab would look like. It then provides an end to end example of working with JupyterLab Integration followed by explaining the differences to Databricks Connect. If you want to try it yourself, the last section explains the installation.

I like this a lot, as it fights back a bit against the balkanization of data science: it means I don’t need to keep one set of notebooks here and another set of notebooks there and a third set of notebooks somewhere else.

Comments closed

Databricks + Azure Synapse Analytics

David Meyer and Clinton Ford explain how you can integrate Azure Databricks with Azure Synapse Analytics:

In the last two years since it first became available, thousands of companies have adopted Azure Databricks, making it one of the fastest growing data and AI services on Microsoft Azure. Customers now process over 2 exabytes per month with millions of server-hours spinning up every day. All of this is driven by organizations like ElectroluxShell, and renewables.AI that are using Azure Databricks to process data at massive scale for data science and analytics.

Within this amazing adoption is a specific solution architecture to highlight called the Modern Data Warehouse (MDW). Earlier this year we wrote about the performance and scale benefits of this solution, and part of the pattern’s success has been our close integration to Azure SQL Data Warehouse with a high-performance connector that was jointly engineered to make it fast and easy to move data between the two services.

Something interesting about Synapse is that its implementation of Spark is not the same as the Databricks implementation (perhaps for licensing reasons). But that doesn’t stop us from using Databricks to process and curate data for Synapse Analytics.

Comments closed

Azure Synapse Analytics, Nee Azure SQL DW

John Macintire explains Azure Synapse Analytics:

A cloud native, distributed SQL processing engine is at the foundation of Azure Synapse and is what enables the service to support the most demanding enterprise data warehousing workloads. This week at Ignite we introduced a number of exciting features to make data warehousing with Azure Synapse easier and allow organizations to use SQL for a broader set of analytics use cases.

There’s a fair amount of marketing-speak in here, but the gist is Azure SQL Data Warehouse + Spark + on-demand serverless queries (so you can, among other things, write T-SQL against your HDFS data). I think it has a better chance of long-lasting success than Azure SQL Data Warehouse.

Comments closed

Joining RDDs in Spark

Brad Llewellyn takes us through more Spark RDD and DataFrame exercises, including joins:

We can make use of the built-in .join() function for RDDs.  Similar to the .aggregateByKey() function we saw in the previous post, the .join() function for RDDs requires a 2-element tuple, with the first element being the key and the second element being the value.  So, we need to use the .map() function to restructure our RDDs to store the keys in the first element and the original array/tuple in the second element.  After the join, we end up with an awkward nested structure of arrays and tuples that we need to restructure using another .map() function, leading to a lengthy code snippet.

This is a place where DataFrames make so much more sense.

Comments closed

Azure AD Credential Passthrough and Databricks

Anna Shrestinian, et al, explain how Azure Databricks enables Azure Active Directory credential passthrough when working with Azure Data Lake Storage Gen2:

Azure Data Lake Storage (ADLS) Gen2, which became generally available earlier this year, is quickly becoming the standard for data storage in Azure for analytics consumption. ADLS Gen2 enables a hierarchical file system that extends Azure Blob Storage capabilities and provides enhanced manageability, security and performance.

The hierarchical file system provides granular access control to ADLS Gen2. Role-based access control (RBAC) could be used to grant role assignments to top-level resources and POSIX compliant access control lists  (ACLs) allow for finer grain permissions at the folder and file level. These features allow users to securely access their data within Azure Databricks using the Azure Blob File System (ABFS) driver, which is built into the Databricks Runtime.

There are some tradeoffs involved, particularly around using High Concurrency clusters (or limiting yourself to one user account), but it’s a nice bit of added value when you’re a heavy Azure user.

Comments closed

A New Notebook Tool: Polynote

Jeremy Smith, et al, announce a new product:

We are pleased to announce the open-source launch of Polynote: a new, polyglot notebook with first-class Scala support, Apache Spark integration, multi-language interoperability including Scala, Python, and SQL, as-you-type autocomplete, and more.

Polynote provides data scientists and machine learning researchers with a notebook environment that allows them the freedom to seamlessly integrate our JVM-based ML platform — which makes heavy use of Scala — with the Python ecosystem’s popular machine learning and visualization libraries. It has seen substantial adoption among Netflix’s personalization and recommendation teams, and it is now being integrated with the rest of our research platform.

There are some nice pieces to it, especially around language interop.

Comments closed

Spark Transformations and Actions

Divyansh Jain differentiates the key sets of functions in Spark:

Now there is a point to be noted here and that is when you apply the transformation on any RDD it will not perform the operation immediately. It will create a DAG(Directed Acyclic Graph) using the applied operation, source RDD and function used for transformation. And it will keep on building this graph using the references till you apply any action operation on the last lined up RDD. That is why the transformation in Spark are lazy.

Read on for more details.

Comments closed