In many Spark applications, performance benefit is obtained from caching the data if reused several times in the applications instead of reading them each time from persistent storage. However, there can be situations when the entire data cannot be cached in the cluster due to resource constraint in the cluster and/or the driver. In this blog we describe two schemes that can be used to partially cache the data by vertical and/or horizontal partitioning of the Distributed Data Frame (DDF) representing the data. Note that these schemes are application specific and are beneficial only if the cached part of the data is used multiple times in consecutive transformations or actions.
In the notebook we declare a
Student case classwith
yearas members. The application is required to find out the number of students by
Partitioning is an interesting idea for trying to speed up Spark performance by keeping everything in memory even when your entire data set is a bit too large.
So how do we do this? Well, the first thing to do is get the number of swearwords per minute. I mentioned that for the original article someone just counted every swearwords, in our case, we’re just going to parse a subtitle file, and extract the swear words from that.
Without going into too much detail, you can find the code I’ve experimtend with in this gist (it’s very ugly code, since I just hacked something together that worked).
Jos includes counts for four movies. This link does contain a few bad words, but if you get past that, it’s a good pattern for analyzing word counts in general.
The HDInsight Tool for Eclipse extends Eclipse to allow you to create and develop HDInsight Spark applications and easily submit Spark jobs to Microsoft Azure HDInsight Spark clusters using the Eclipse development environment. It integrates seamlessly with Azure, enabling you to easily navigate HDInsight Spark clusters and to view associated Azure storage accounts. To further boost productivity, the HDInsight tool for Eclipse also offers the capability to view Spark job history and display detailed job logs.
Check out the link for videos and additional resources.
We are proud to introduce the Getting Started with Apache Spark on Databricks Guide. This step-by-step guide illustrates how to leverage the Databricks’ platform to work with Apache Spark. Our just-in-time data platform simplifies common challenges when working with Spark: data integration, real-time experimentation, and robust deployment of production applications.
Databricks provides a simple, just-in-time data platform designed for data analysts, data scientists, and engineers. Using Databricks, this step-by-step guide helps you solve real-world Data Sciences and Data Engineering scenarios with Apache Spark. It will help you familiarize yourself with the Spark UI, learn how to create Spark jobs, load data and work with Datasets, get familiar with Spark’s DataFrames and Datasets API, run machine learning algorithms, and understand the basic concepts behind Spark Streaming.
If you are at all interested in distributed databases, Spark is a must-learn.
There are a couple of notes with these clusters:
These are not powerful clusters. Don’t expect to crunch huge data sets with them. Notice that the cluster has only 6 GB of RAM, so you can expect to get maybe a few GB of data max.
The cluster will automatically terminate after one hour without activity. The paid version does not have this limitation.
You interact with the cluster using notebooks rather than opening a command prompt. In practice, this makes interacting with the cluster a little more difficult, as a good command prompt can provide features such as auto-complete.
Databricks Community Edition has a nice interface, is very easy to get up and running and—most importantly—is free. Read the whole thing.
At the core of Apache Spark is the notion of data abstraction as distributed collection of objects. This data abstraction, called Resilient Distributed Dataset (RDD), allows you to write programs that transform these distributed datasets.
RDDs are immutable distributed collection of elements of your data that can be stored in memory or disk across a cluster of machines. The data is partitioned across machines in your cluster that can be operated in parallel with a low-level API that offers transformations and actions. RDDs are fault tolerant as they track data lineage information to rebuild lost data automatically on failure.
Some of these concepts are new to Spark 2.0, but all are worth learning.
To use this post to play around with streaming data, you need an AWS account and AWS CLI configured on your machine. The entire pattern can be implemented in few simple steps:
Create an Amazon Kinesis stream.
Spin up an EMR cluster with Hadoop, Spark, and Zeppelin applications from advanced options.
Use a Simple Java producer to push random IoT events data into the Amazon Kinesis stream.
Connect to the Zeppelin notebook.
Import the Zeppelin notebook from GitHub.
Analyze and visualize the streaming data.
This is a good way of getting started with streaming data. I’ve grown quite fond of notebooks in the short time that I’ve used them, as they make it very easy for people who know what they’re doing to provide code and information to people who want to know what they’re doing.
Once you have identified and broken down the Spark and associated infrastructure and application components you want to monitor, you need to understand the metrics that you should really care about that affects the performance of your application as well as your infrastructure. Let’s dig deeper into some of the things you should care about monitoring.
In Spark, it is well known that Memory related issues are typical if you haven’t paid attention to the memory usage when building your application. Make sure you track garbage collection and memory across the cluster on each component, specifically, the executors and the driver. Garbage collection stalls or abnormality in patterns can increase back pressure.
There are a few metrics of note here. Check it out.
Spark provides metrics for each of the above components through different endpoints. For example, if you want to look at the Spark driver details, you need to know the exact URL, which keeps changing over time–Spark keeps you guessing on the URL. The typical problem is when you start your driver in cluster mode. How do you detect on which worker node the driver was started? Once there, how do you identify the port on which the Spark driver exposes its UI? This seems to be a common annoying issue for most developers and DevOps professionals who are managing Spark clusters. In fact, most end up running their driver in client mode as a workaround, so they have a fixed URL endpoint to look at. However, this is being done at the cost of losing failover protection for the driver. Your monitoring solution should be automatically able to figure out where the driver for your application is running, find out the port for the application and automatically configure itself to start collecting metrics.
For a dynamic infrastructure like Spark, your cluster can get resized on the fly. You must ensure your newly spawned components (Workers, executors) are automatically configured for monitoring. There is no room for manual intervention here. You shouldn’t miss out monitoring newer processes that show up on the cluster. On the other hand, you shouldn’t be generating false alerts when executors get moved around. A general monitoring solution will typically start alerting you if an executor gets killed and starts up on a new worker–this is because generic monitoring solutions just monitor your port to check if it’s up or down. With a real time streaming system like Spark, the core idea is that things can move around all the time.
Spark does add a bit of complexity to monitoring, but there are solutions in place. Read the whole thing.
meinestadt.de web servers generate up to 20 million user sessions per day, which can easily result in up to several thousand HTTP GET requests per second during peak times (and expected to scale to much higher volumes in the future). Although there is a permanent fraction of bad requests, at times the number of bad requests jumps.
The meinestadt.de approach is to use a Spark Streaming application to feed an Impala table every n minutes with the current counts of HTTP status codes within the n minutes window. Analysts and engineers query the table via standard BI tools to detect bad requests.
What follows is a fairly detailed architectural walkthrough as well as configuration and implementation work. It’s a fairly long read, but if you’re interested in delving into Hadoop, it’s a good place to start.