Press "Enter" to skip to content

Category: Spark

PySpark With MapR

Justin Brandenburg has a tutorial on combining Python and Spark on the MapR platform:

Looking at the first 5 records of the RDD

kddcup_data.take(5)
This output is difficult to read. This is because we are asking PySpark to show us data that is in the RDD format. PySpark has a DataFrame functionality. If the Python version is 2.7 or higher, you can utilize the pandas package. However, pandas doesn’t work on Python versions 2.6, so we use the Spark SQL functionality to create DataFrames for exploration.

The full example is a fairly simple k-means clustering process, which is a great introduction to PySpark.

Comments closed

SparkSession

Jules Damji shows off SparkSession:

Beyond a time-bounded interaction, SparkSession provides a single point of entry to interact with underlying Spark functionality and allows programming Spark with DataFrame and Dataset APIs. Most importantly, it curbs the number of concepts and constructs a developer has to juggle while interacting with Spark.

In this blog and its accompanying Databricks notebook, we will explore SparkSession functionality in Spark 2.0.

This looks to be an easier method for integrating various parts of Spark in one user session.  Read the whole thing.

Comments closed

Connecting Spark and Riak

Pavel Hardak discusses the Riak Connector for Apache Spark:

Modeled using principles from the “AWS Dynamo” paper, Riak KV buckets are good for scenarios which require frequent, small data-sized operations in near real-time, especially workloads with reads, writes, and updates — something which might cause data corruption in some distributed databases or bring them to “crawl” under bigger workloads. In Riak, each data item is replicated on several nodes, which allows the database to process a huge number of operations with very low latency while having unique anti-corruption and conflict-resolution mechanisms. However, integration with Apache Spark requires a very different mode of operation — extracting large amounts of data in bulk, so that Spark can do its “magic” in memory over the whole data set. One approach to solve this challenge is to create a myriad of Spark workers, each asking for several data items. This approach works well with Riak, but it creates unacceptable overhead on the Spark side.

This is interesting in that it ties together two data platforms whose strengths are almost the opposite:  one is great for fast, small writes of single records and the other is great for operating on large batches of data.

Comments closed

Unit Testing Of Spark Streaming

Felipe Fernandez shows how to unit test Spark Streaming:

Controlling the lifecycle of Spark can be cumbersome and tedious. Fortunately, Spark Testing Baseproject offers us Scala Traits that handle those low-level details for us. Streaming has an extra bit of complexity as we need to produce data for ingestion in a timely way. At the same time, Spark internal clock needs to tick in a controlled way if we want to test timed operations as sliding windows.

This is part one of a series.  I’m interesting in seeing where this goes.

Comments closed

Securing Spark Shuffle

Cheng Xu uses Apache Commons Crypto to secure data when Spark shuffles off to disk:

The basic steps can be described as follows:

  1. When a Spark job starts, it will generate encryption keys and store them in the current user’s credentials, which are shared with all executors.

  2. When shuffle happens, the shuffle writer will first compress the plaintext if compression is enabled. Spark will use the randomly generated Initial Vector (IV) and keys obtained from the credentials to encrypt the plaintext by using CryptoOutputStream from Crypto.

  3. CryptoOutputStream will encrypt the shuffle data and write it to the disk as it arrives. The first 16 bytes of the encrypted output file are preserved to store the initial vector.

  4. For the read path, the first 16 bytes are used to initialize the IV, which is provided to CryptoInputStreamalong with the user’s credentials. The decrypted data is then provided to Spark’s shuffle mechanism for further processing.

Once you have things optimized, the performance hit is surprisingly small.

Comments closed

Microsoft R Server On Spark

Max Kaznady, et al, discuss using Microsoft R Server on Spark to perform rapid prototyping against the NYC Taxi dataset:

Once the cluster is created, you can connect to the edge node where MRS is already pre-installed by SSHing to r-server.YOURCLUSTERNAME-ssh.azurehdinsight.net with the credentials which you supplied during the cluster creation process. In order to do this in MobaXterm, you can go to Sessions, then New Sessions and then SSH.

The default installation of HDI Spark on Linux cluster does not come with RStudio Server installed on the edge node. RStudio Server is a popular open source integrated development environment (IDE) available for R that provides a browser-based IDE for use by remote clients. This tool allows you to benefit from all the power of R, Spark and Microsoft HDInsight cluster through your browser. In order to install RStudio you can follow the steps detailed in the guide, which reduces to running a script on the edge node.

If you’ve been meaning to get further into Spark & R, this is a great article to follow along with on your own.

Comments closed

Developing Spark Applications In .NET

Kaarthik Sivashanmugam talks about Mobius, a Microsoft-driven .NET wrapper for Spark:

The C# language binding to Spark is similar to the Python and R bindings. In fact, Mobius follows the same design pattern and leverages the existing implementation of language binding components in Spark where applicable for consistency and reuse. The following picture shows the dependency between the .NET application and the C# API in Mobius, which internally depends on Spark’s public API in Scala and Java and extends PythonRDD from PySpark to implement CSharpRDD.

Looks like there’s some fuzziness on just how well F# is supported.  Still, this is very exciting as a way of bridging the gap for .NET developers.

Comments closed

Structured Streaming

Matei Zaharia, et al discuss how to use structured streaming in Apache Spark 2.0:

In Structured Streaming, we tackle the issue of semantics head-on by making a strong guarantee about the system: at any time, the output of the application is equivalent to executing a batch job on a prefix of the data. For example, in our monitoring application, the result table in MySQL will always be equivalent to taking a prefix of each phone’s update stream (whatever data made it to the system so far) and running the SQL query we showed above. There will never be “open” events counted faster than “close” events, duplicate updates on failure, etc. Structured Streaming automatically handles consistency and reliability both within the engine and in interactions with external systems (e.g. updating MySQL transactionally).

If you want to learn more about streaming data using Spark, check this out.

Comments closed

Don’t Use Cron For Scheduling Hadoop Jobs

Matthew Rathbone explains why cron is not a great choice for scheduling Hadoop and Spark jobs:

Reason 3: Poor transparency for teammates

Which jobs are running right now? Which are going to run today? How long do these jobs take? How do I schedule my job? What machine should I schedule it on? These are all questions that are impossible to answer without building custom orchestration around your Cron process – time you’d be better off spending on building a better system.

Matthew then gives us four alternative products.

Comments closed

Structured Streaming

Andrew Ray explains streaming solutions using Spark 2.0:

If you are familiar with traditional Spark streaming you may notice that the above example is lacking an explicit batch duration. In structured streaming the equivalent feature is a trigger. By default it will run batches as quickly as possible, starting the next batch as soon as more data is available and the previous batch is complete. You can also set a more traditional fixed batch interval for your trigger. In the future more flexible trigger options will be added.

A related consequence is that windows are no longer forced to be a multiple of the batch duration. Furthermore, windows needn’t be only on processing time anymore, we can rearrange events that may have been delayed or arrived out of order and window by event time. Suppose our input stream had a column event_time that we wanted to do windowed counts on. Then we could do something like the following to get counts of events in a 1 minute window:

Right now, there are some pretty strict limitations on this new streaming, but I imagine they’ll loosen up quite soon.

Comments closed