Press "Enter" to skip to content

Category: Spark

Installing Apache Mesos On EC2

Anubhav Tarar has a guide for setting up Apache Mesos along with Spark and Hadoop on EC2:

Apache Mesos is open source project for managing computer clusters originally developed at the University Of California. It sits between the application layer and operating system to manage the application works efficiently on the large-scale distributed environment.

In this blog, we will see how to setup mesos client and master on ec2 from scratch.

Read on for the step-by-step guide.

Comments closed

PySpark DataFrame Transformations

Vincent-Philippe Lauzon shows how to perform data frame transformations using PySpark:

We wanted to look at some more Data Frames, with a bigger data set, more precisely some transformation techniques.  We often say that most of the leg work in Machine learning in data cleansing.  Similarly we can affirm that the clever & insightful aggregation query performed on a large dataset can only be executed after a considerable amount of work has been done into formatting, filtering & massaging data:  data wrangling.

Here, we’ll look at an interesting dataset, the H-1B Visa Petitions 2011-2016 (from Kaggle) and find some good insights with just a few queries, but also some data wrangling.

It is important to note that about everything in this article isn’t specific to Azure Databricks and would work with any distribution of Apache Spark.

The notebook used for this article is persisted on GitHub.

Read on for explanation, or check out the notebook to work on it at your own pace.

Comments closed

Installing Spark On Windows

Nigel Meakins is starting a new series on Spark and his first post involves installing Spark on Windows:

WinUtils provides a number of HDFS-emulating utilities that allow us to run Spark as though it were talking to an HDFS storage system (at least to a certain degree). Without this you will get all manner of file system-related issues wit Spark and won’t get off the launchpad.

Within the WinUtils archive you may have a number of Hortonworks Data Platform versioned folders. For the version of Spark I’m using, being 2.2.1, I have chosen hadoop-2,7,1\bin for my files. Unzip and copy the contents of the bin directory to a directory of your choice. It must however be called ‘bin’ in order to be located by the calling programs. I actually placed mine in the C:\Spark\bin directory together with the other executables that Spark uses but this is not essential.

Once done, you will need to set the following environment variable:

HADOOP_HOME = <your winutils ‘bin’ parent directory>

Note we don’t include the \bin, so for my example this is C:\Spark.

I have a post on installing Spark on Windows that might help if you get stuck on the WinUtils part.

Comments closed

Functions In Spark

Fisseha Berhane continues his RDDs vs DataFrames vs SparkSQL series, this time looking at functions:

Let’s use Spark SQL and DataFrame APIs ro retrieve companies ranked by sales totals from the SalesOrderHeader and SalesLTCustomer tables. We will display the first 10 rows from the solution using each method to just compare our answers to make sure we are doing it right.

All three approaches give the same results, though the SQL approach seems to me to be the easiest.

Comments closed

Leveraging Hive In Pyspark

Fisseha Berhane shows how to use Spark to connect Python to Hive:

If we are using earlier Spark versions, we have to use HiveContext which is variant of Spark SQL that integrates with data stored in Hive. Even when we do not have an existing Hive deployment, we can still enable Hive support.
In this tutorial, I am using standalone Spark. When not configured by the Hive-site.xml, the context automatically creates metastore_db in the current directory.

As shown below, initially, we do not have metastore_db but after we instantiate SparkSession with Hive support, we see that metastore_db has been created. Further, when we execute create database command, spark-warehouse is created.

Click through for a bunch of examples.

Comments closed

Unit Testing Spark Streaming DStreams

Anuj Saxena shows how to create unit tests for DStreams in Spark Streaming:

The method ‘ testOperation ‘ takes the output of the operation performed on the ‘inputPair’ and check whether it is equal to the ‘outputPair’ and just like this, we can test our business logic.

This short snippet lets you test your business logic without forcing you to create even a Spark session. You can mock the whole streaming environment and test your business logic easily.

This was a simple example of unary operations on DStreams. Similarly, we can test binary operations and window operations on DStreams.

Click through for an example with code.

Comments closed

Set Operations In Spark

Fisseha Berhane compares SparkSQL, DataFrames, and classic RDDs when performing certain set-based operations:

In this fourth part, we will see set operators in Spark the RDD way, the DataFrame way and the SparkSQL way.
Also, check out my other recent blog posts on Spark on Analyzing the Bible and the Quran using Spark and Spark DataFrames: Exploring Chicago Crimes.

The data and the notebooks can be downloaded from my GitHub repository.
The three types of set operators in RDD, DataFrame and SQL approach are shown below.

This is where SparkSQL (and SQL in general) shines, although the DataFrame approach is also compact.

Comments closed

Setting Up SparklyR In Azure

David Smith shows how you can spin up a Spark cluster in Azure and install SparklyR on top of it:

The SparklyR package from RStudio provides a high-level interface to Spark from R. This means you can create R objects that point to data frames stored in the Spark cluster and apply some familiar R paradigms (like dplyr) to the data, all the while leveraging Spark’s distributed architecture without having to worry about memory limitations in R. You can also access the distributed machine-learning algorithms included in Spark directly from R functions.

If you don’t happen to have a cluster of Spark-enabled machines set up in a nearby well-ventilated closet, you can easily set one up in your favorite cloud service. For Azure, one option is to launch a Spark cluster in HDInsight, which also includes the extensions of Microsoft ML Server. While this service recently had a significant price reduction, it’s still more expensive than running a “vanilla” Spark-and-R cluster. If you’d like to take the vanilla route, a new guide details how to set up Spark cluster on Azure for use with SparklyR.

Read on for more details.

Comments closed

How Meltdown And Spectre Have Affected Spark Performance

Chris Stevens, et al, show how DAtabricks customers have fared in a post-Meltdown+Spectre world:

On AWS, we have observed a small performance degradation up to 5% since January 4th. On i3-series instance types, where we cache data on the local NVMe SSDs (Databricks Cache), we have observed a degradation up to 5%. On r3-series instance types, in which the benchmark jobs read data exclusively from remote storage (S3), we have observed a smaller increase of up to 3%. The greater percentage slowdown for the i3 instance type is explained by the larger number of syscalls performed when reading from the local SSD cache.

The chart below shows before and after January 3rd in AWS for a r3-series (memory optimized) and i3-series (storage optimized) based cluster.  Both tests fixed to the same runtime version and cluster size. The data represents the average of the full benchmark’s runtime per day, for a total of 7 days prior to January 3 (before is in blue) and 7 days after January 3 (after is in red). We exclude January 3rd to prevent partial results.  As mentioned, the i3-series has the Databricks Cache enabled on the local SSDs, resulting in roughly half of the total execution time (faster) compared to the r3-series results.

Overall, they’re seeing a degredation of 2-5%.  Click through for some more information on how they collected their metrics.

Comments closed

Spark And NVMe

Alicja Luszczak, et al, introduce NVMe caching in the Databricks distribution of Spark:

A particularly important and widespread use case is caching the results of scan operations. This allows the users to eliminate the low throughput associated with reading remote data. For this reason, many users who intend to run the same or similar workload repeatedly decide to invest extra development time into manually optimizing their application, by instructing Spark exactly what files to cache and when to do it, and thus “explicit caching.”

For all its utility, Spark cache also has a number of shortcomings. First, when the data is cached in the main memory, it takes up space that could be better used for other purposes during query execution, for example, for shuffles or hash tables. Second, when the data is cached on the disk, it has to be deserialized when read — a process that is too slow to adequately utilize the high read bandwidths commonly offered by the NVMe SSDs. As a result, occasionally Spark applications actually find their performance regressing when turning on Spark caching.

Third, having to plan ahead and explicitly declare which data should be cached is challenging for the users who want to interactively explore the data or build reports. While Spark cache gives data engineers all the knobs to tune, data scientist often find it difficult to reason about the cache, especially in a multi-tenant setting, where engineers still require the results to be returned as quickly as possible in order to keep the iteration time short.

Read on for more details, as well as performance comparisons.

Comments closed