Press "Enter" to skip to content

Category: Spark

Troubleshooting Spark Performance

Bikas Saha and Mridul Murlidharan explain some of the basics of performance tuning with Apache Spark:

Our objective was to build a system that would provide an intuitive insight into Spark jobs that not just provides visibility but also codifies the best practices and deep experience we have gained after years of debugging and optimizing Spark jobs. The main design objectives were to be
– Intuitive and easy – Big data practitioners should be able to navigate and ramp quickly
– Concise and focused – Hide the complexity and scale but present all necessary information in a way that does not overwhelm the end user
– Batteries included – Provide actionable recommendations for a self service experience, especially for users who are less familiar with Spark
– Extensible – To enable additions of deep dives for the most common and difficult scenarios as we come across them

The tool looks pretty interesting and I’m hoping it will be part of the open source suite at Cloudera.

Comments closed

Spark Memory Management on EMR

Karunanithi Shanmugam gives us some tips on memory management for Spark in Amazon’s ElasticMapReduce:

Amazon EMR provides high-level information on how it sets the default values for Spark parameters in the release guide. These values are automatically set in the spark-defaults settings based on the core and task instance types in the cluster.

To use all the resources available in a cluster, set the maximizeResourceAllocation parameter to true. This EMR-specific option calculates the maximum compute and memory resources available for an executor on an instance in the core instance group. It then sets these parameters in the spark-defaults settings. Even with this setting, generally the default numbers are low and the application doesn’t use the full strength of the cluster. For example, the default for spark.default.parallelism is only 2 x the number of virtual cores available, though parallelism can be higher for a large cluster.

Spark on YARN can dynamically scale the number of executors used for a Spark application based on the workloads. Using Amazon EMR release version 4.4.0 and later, dynamic allocation is enabled by default (as described in the Spark documentation).

There’s a lot in here, much of which applies to Spark in general and not just EMR.

Comments closed

CSV Data Ingestion with Spark

Jean Georges Perrin shows how you can easily load CSV data with Spark:

Fortunately for you, Apache Spark offers a variety of options for ingesting those CSV files. Ingesting CSV is easy and schema inference is a powerful feature.
Let’s have a look at more advanced examples with more options that illustrate the complexity of CSV files in the outside world. You’ll first look at the file you’ll ingest, and understand its specifications. You’ll then have a look at the result and finally build the mini-application to achieve the result. This pattern repeats for each format.

It’s good to see some of the lesser-used features pop up like date format and multi-line support (which I hadn’t even known about).

Comments closed

Benefits of Partitioning in Spark

The Hadoop in Real World team take a look at how appropriate partitioning can make your Spark jobs much faster:

Shuffle is an expensive operation whether you do it with plain old MapReduce programs or with Spark. Shuffle is he process of bringing Key Value pairs from different mappers (or tasks in Spark) by Key in to a single reducer (task in Spark). So all key value pairs of the same key will end up in one task (node). So we can loop through the key value pairs and do the needed aggregation.

Since production jobs usually involve a lot of tasks in Spark, the key value pairs movement between nodes during shuffle (from one task to another) will cause a significant bottleneck. In some cases Shuffle is not avoidable but in many instances you could avoid shuffle by structuring your data little differently. Avoiding shuffle will have an positive impact on performance.

Read the whole thing. Getting partitions right is critical to writing scalable Spark jobs.

Comments closed

Memory Management in Spark

Rishitesh Mishra has started a new series on slow or failing Spark applications and starts with the big reason:

If we were to got all Spark developers to vote, out of memory (OOM) conditions would surely be the number one problem everyone has faced. This comes as no big surprise as Spark’s architecture is memory-centric. Some of the most common causes of OOM are:
* Incorrect usage of Spark
* High concurrency
* Inefficient queries
* Incorrect configuration

Definitely worth the read.

Comments closed

Getting Started with Azure Databricks

Brad Llewellyn has a tutorial for Azure Databricks:

Databricks is a managed Spark framework, similar to what we saw with HDInsight in the previous post.  The major difference between the two technologies is that HDInsight is more of a managed provisioning service for Hadoop, while Databricks is more like a managed Spark platform.  In other words, HDInsight is a good choice if we need the ability to manage the cluster ourselves, but don’t want to deal with provisioning, while Databricks is a good choice when we simply want to have a Spark environment for running our code with little need for maintenance or management.

Azure Databricks is not a Microsoft product.  It is owned and managed by the company Databricks and available in Azure and AWS.  However, Databricks is a “first party offering” in Azure.  This means that Microsoft offers the same level of support, functionality and integration as it would with any of its own products.  You can read more about Azure Databricks herehereand here.

Click through for a demonstration of the product.

Comments closed

Developing Big Data Cluster Spark Jobs with IntelliJ

Jenny Jiang shows how we can use IntelliJ IDEA to develop Spark jobs against SQL Server Big Data Clusters:

We’re delighted to release the Azure Toolkit for IntelliJ support for SQL Server Big Data Cluster Spark job development and submission. For first-time Spark developers, it can often be hard to get started and build their first application, with long and tedious development cycles in the integrated development environment (IDE). This toolkit empowers new users to get started with Spark in just a few minutes. Experienced Spark developers also find it faster and easier to iterate their development cycle.

The toolkit extends IntelliJ support for the Spark job life cycle starting from creation, authoring, and debugging, through submission of jobs to SQL Server Big Data Clusters. It enables you to enjoy a native Scala and Java Spark application development experience and quickly start a project using built-in templates and sample code. The integration with SQL Server Big Data Cluster empowers you to quickly submit a job to the big data cluster as well as monitor its progress. The Spark console allows you to check schemas, preview data, and validate your code logic in a shell-like environment while you can develop Spark batch jobs within the same toolkit.

It looks pretty good from my vantage point.

Comments closed

The SparkSession Object

Achilleus explains SparkSession and why it’s useful even if you know about SparkContext:

Spark session is a unified entry point of a spark application from Spark 2.0. It provides a way to interact with various spark’s functionality with a lesser number of constructs. Instead of having a spark context, hive context, SQL context, now all of it is encapsulated in a Spark session.

Read on to learn more about SparkSession and how you can use it.

Comments closed

Deploying Azure Databricks in a Custom VNET

Abhinav Garg and Anna Shrestinian explain how you can use VNET injection with Azure Databricks:

To make the above possible, we provide a Bring Your Own VNET (also called VNET Injection) feature, which allows customers to deploy the Azure Databricks clusters (data plane) in their own-managed VNETs. Such workspaces could be deployed using Azure Portal, or in an automated fashion using ARM Templates, which could be run using Azure CLI, Azure Powershell, Azure Python SDK, etc.

With this capability, the Databricks workspace NSG is also managed by the customer. We manage a set of inbound and outbound NSG rules using a Network Intent Policy, as those are required for secure, bidirectional communication with the control/management plane. 

This is a good article if the defaults won’t get past corporate security.

Comments closed

Bring .NET Support to Spark

I have a request that you vote up a Spark issue:

There is a Jira ticket for the Apache Spark project, SPARK-27006. The gist of this ticket is to bring .NET support to Spark, specifically by supporting DataFrames in C# (and hopefully F#). No support for Datasets or RDDs is included in here, but giving .NET developers DataFrame access would make it easy for us to write code which interacts with Spark SQL and a good chunk of the SparkSession object.

You an click through and read everything I have to say, but do go to the Spark ticket and vote for .NET support.

Comments closed