Press "Enter" to skip to content

Category: Spark

Tuning Kafka And Spark Data Pipelines

Larry Murdock explains the tuning options available to Kafka and Spark Streams:

Kafka is not the Ferrari of messaging middleware, rather it is the salt flats rocket car. It is fast, but don’t expect to find an AUX jack for your iPhone. Everything is stripped down for speed.

Compared to other messaging middleware, the core is simpler and handles fewer features. It is a transaction log and its job is to take the message you sent asynchronously and write it to disk as soon as possible, returning an acknowledgement once it is committed via an optional callback. You can force a degree of synchronicity by chaining a get to the send call, but that is kind of cheating Kafka’s intention. It does not send it on to a receiver. It only does pub-sub. It does not handle back pressure for you.

I like this as a high-level overview of the different options available.  Definitely gets a More Research Is Required tag, but this post helps you figure out where to go next.

Comments closed

replyr

John Mount shows off replyr, which is dplyr for remote, distributed data sets (think SparkR or sparklyr):

Suppose we had a large data set hosted on a Spark cluster that we wished to work with using dplyr and sparklyr (for this article we will simulate such using data loaded into Spark from the nycflights13 package).

We will work a trivial example: taking a quick peek at your data. The analyst should always be able to and willing to look at the data.

It is easy to look at the top of the data, or any specific set of rows of the data.

Read on for more details.

Comments closed

Building A Neural Net

Shirin Glander has a great post on using Spark + sparklyr + h2o + rsparkling to build a neural net to study arrhythmia of the heart:

The data I am using to demonstrate the building of neural nets is the arrhythmia dataset from UC Irvine’s machine learning database. It contains 279 features from ECG heart rhythm diagnostics and one output column. I am not going to rename the feature columns because they are too many and the descriptions are too complex. Also, we don’t need to know specifically which features we are looking at for building the models. For a description of each feature, see https://archive.ics.uci.edu/ml/machine-learning-databases/arrhythmia/arrhythmia.names. The output column defines 16 classes: class 1 samples are from healthy ECGs, the remaining classes belong to different types of arrhythmia, with class 16 being all remaining arrhythmia cases that didn’t fit into distinct classes.

Very interesting post.

Comments closed

Broadcast Nested Loop Joins In Spark

Reynold Xin, et al, debug an interesting test case:

While we were pretty happy with the improvement, we noticed that one of the test cases in Databricks started failing. To simulate a hanging query, the test case performed a cross join to produce 1 trillion rows.

spark.range(1000 * 1000).crossJoin(spark.range(1000 * 1000)).count()

On a single node, we expected this query would run infinitely or “hang.” To our surprise, we started seeing this test case failing nondeterministically because sometimes it completed on our Jenkins infrastructure in less than one second, the time limit we put on this query.

You’re not going to get this performance against a real data set, but it was interesting reading their troubleshooting notes.

Comments closed

Analyzing Flight Data With Sparklyr

Aki Ariga continues his sparklyr series with some analysis of US flight data:

In this post, we will show you a visualization and build a predictive model of US flights with sparklyr. Flight visualization code is based on this article.

This post assumes you already have the following tables:

You should make these tables available through Apache Hive or Apache Impala (incubating) with Hue.

There’s some setup work to get this going, but getting a handle on sparklyr looks to be a good idea if you’re in the analytics space.

Comments closed

Spark Data Frame Checkpoints

Jean Georges Perrin introduces checkpoints on Spark data frames:

Basically, I use a checkpoint if I want to freeze the content of my data frame before I do something else. It can be in the scenario of iterative algorithms (as mentioned in the Javadoc) but also in recursive algorithms or simply branching out a data frame to run different kinds of analytics on both.

Spark has been offering checkpoints on streaming since earlier versions (at least v1.2.0), but checkpoints on data frames are a different beast.

This could also be very useful for a quality control flow:  perform operation A, and if it doesn’t generate good enough results, roll back and try operation B.

Comments closed

Apache Zeppelin 0.7.0

Vinay Shulka announces Apache Zeppelin 0.7.0:

SPARK IMPROVEMENTS

This release also adds support for Spark 2 including version Spark 2.1. Zeppelin now also links to Spark History Server UI from Zeppelin so users can more easily track Spark jobs. The Livy interpreter now supports specifying packages with the job.

SECURITY IMPROVEMENTS

The major security improvement in Zeppelin 0.7.0 is using Apache Knox’s LDAP Realm to connect to LDAP. Zeppelin home page now lists only the nodes to which the user is authorized to access. Zeppelin now also has the ability to support PAM based authentication.

The full list of improvements is available here

This visualization platform is growing up nicely.

Comments closed

Using Sparklyr To Analyze Flight Data

Aki Ariga uses sparklyr on Apache Spark 2.0 to analyze flight data living in S3:

Using sparklyr enables you to analyze big data on Amazon S3 with R smoothly. You can build a Spark cluster easily with Cloudera Director. sparklyr makes Spark as a backend database of dplyr. You can create tidy data from huge messy data, plot complex maps from this big data the same way as small data, and build a predictive model from big data with MLlib. I believe sparklyr helps all R users perform exploratory data analysis faster and easier on large-scale data. Let’s try!

You can see the Rmarkdown of this analysis on RPubs. With RStudio, you can share Rmarkdown easily on RPubs.

Sparklyr is an exciting technology for distributed data analysis.

Comments closed

Spark UDFs

Curtis Howard explains how to create a user-defined function in Apache Spark:

It’s important to understand the performance implications of Apache Spark’s UDF features.  Python UDFs for example (such as our CTOF function) result in data being serialized between the executor JVM and the Python interpreter running the UDF logic – this significantly reduces performance as compared to UDF implementations in Java or Scala.  Potential solutions to alleviate this serialization bottleneck include:

  1. Accessing a Hive UDF from PySpark as discussed in the previous section.  The Java UDF implementation is accessible directly by the executor JVM.  Note again that this approach only provides access to the UDF from the Apache Spark’s SQL query language.
  2. Making use of the approach also shown to access UDFs implemented in Java or Scala from PySpark, as we demonstrated using the previously defined Scala UDAF example.

In general, UDF logic should be as lean as possible, given that it will be called for each row.  As an example, a step in the UDF logic taking 100 milliseconds to complete will quickly lead to major performance issues when scaling to 1 billion rows.

Definitely worth a read.  UDFs in Spark can come at a performance penalty, so they aren’t free.

Comments closed

Structured Streaming With Spark

Tathagata Das, et al, discuss enterprise-grade streaming of structured data using Spark:

Fortunately, Structured Streaming makes it easy to convert these periodic batch jobs to a real-time data pipeline. Streaming jobs are expressed using the same APIs as batch data. Additionally, the engine provides the same fault-tolerance and data consistency guarantees as periodic batch jobs, while providing much lower end-to-end latency.

In the rest of post, we dive into the details of how we transform AWS CloudTrail audit logs into an efficient, partitioned, parquet data warehouse. AWS CloudTrail allows us to track all actions performed in a variety of AWS accounts, by delivering gzipped JSON logs files to a S3 bucket. These files enable a variety of business and mission critical intelligence, such as cost attribution and security monitoring. However, in their original form, they are very costly to query, even with the capabilities of Apache Spark. To enable rapid insight, we run a Continuous Application that transforms the raw JSON logs files into an optimized Parquet table. Let’s dive in and look at how to write this pipeline. If you want to see the full code, here are the Scala and Python notebooks. Import them into Databricks and run them yourselves.

This introductory post discusses some of the architecture and setup, and they promise additional posts getting into finer details.

Comments closed