Suppose we had a large data set hosted on a
Sparkcluster that we wished to work with using
sparklyr(for this article we will simulate such using data loaded into
We will work a trivial example: taking a quick peek at your data. The analyst should always be able to and willing to look at the data.
It is easy to look at the top of the data, or any specific set of rows of the data.
Read on for more details.
The data I am using to demonstrate the building of neural nets is the arrhythmia dataset from UC Irvine’s machine learning database. It contains 279 features from ECG heart rhythm diagnostics and one output column. I am not going to rename the feature columns because they are too many and the descriptions are too complex. Also, we don’t need to know specifically which features we are looking at for building the models. For a description of each feature, see https://archive.ics.uci.edu/ml/machine-learning-databases/arrhythmia/arrhythmia.names. The output column defines 16 classes: class 1 samples are from healthy ECGs, the remaining classes belong to different types of arrhythmia, with class 16 being all remaining arrhythmia cases that didn’t fit into distinct classes.
Very interesting post.
While we were pretty happy with the improvement, we noticed that one of the test cases in Databricks started failing. To simulate a hanging query, the test case performed a cross join to produce 1 trillion rows.
spark.range(1000 * 1000).crossJoin(spark.range(1000 * 1000)).count()
On a single node, we expected this query would run infinitely or “hang.” To our surprise, we started seeing this test case failing nondeterministically because sometimes it completed on our Jenkins infrastructure in less than one second, the time limit we put on this query.
You’re not going to get this performance against a real data set, but it was interesting reading their troubleshooting notes.
In this post, we will show you a visualization and build a predictive model of US flights with sparklyr. Flight visualization code is based on this article.
This post assumes you already have the following tables:
- Airlines data as
airlines_bi_pq. It is assumed to be on S3, but you can put it into HDFS. See also the Ibis project.
- Airports data converted into Parquet format as
airports_new_pq. See also 2009 ASA Data Expo.
You should make these tables available through Apache Hive or Apache Impala (incubating) with Hue.
There’s some setup work to get this going, but getting a handle on sparklyr looks to be a good idea if you’re in the analytics space.
Basically, I use a checkpoint if I want to freeze the content of my data frame before I do something else. It can be in the scenario of iterative algorithms (as mentioned in the Javadoc) but also in recursive algorithms or simply branching out a data frame to run different kinds of analytics on both.
Spark has been offering checkpoints on streaming since earlier versions (at least v1.2.0), but checkpoints on data frames are a different beast.
This could also be very useful for a quality control flow: perform operation A, and if it doesn’t generate good enough results, roll back and try operation B.
This release also adds support for Spark 2 including version Spark 2.1. Zeppelin now also links to Spark History Server UI from Zeppelin so users can more easily track Spark jobs. The Livy interpreter now supports specifying packages with the job.
The major security improvement in Zeppelin 0.7.0 is using Apache Knox’s LDAP Realm to connect to LDAP. Zeppelin home page now lists only the nodes to which the user is authorized to access. Zeppelin now also has the ability to support PAM based authentication.
The full list of improvements is available here
This visualization platform is growing up nicely.
Using sparklyr enables you to analyze big data on Amazon S3 with R smoothly. You can build a Spark cluster easily with Cloudera Director. sparklyr makes Spark as a backend database of dplyr. You can create tidy data from huge messy data, plot complex maps from this big data the same way as small data, and build a predictive model from big data with MLlib. I believe sparklyr helps all R users perform exploratory data analysis faster and easier on large-scale data. Let’s try!
You can see the Rmarkdown of this analysis on RPubs. With RStudio, you can share Rmarkdown easily on RPubs.
Sparklyr is an exciting technology for distributed data analysis.
It’s important to understand the performance implications of Apache Spark’s UDF features. Python UDFs for example (such as our CTOF function) result in data being serialized between the executor JVM and the Python interpreter running the UDF logic – this significantly reduces performance as compared to UDF implementations in Java or Scala. Potential solutions to alleviate this serialization bottleneck include:
- Accessing a Hive UDF from PySpark as discussed in the previous section. The Java UDF implementation is accessible directly by the executor JVM. Note again that this approach only provides access to the UDF from the Apache Spark’s SQL query language.
- Making use of the approach also shown to access UDFs implemented in Java or Scala from PySpark, as we demonstrated using the previously defined Scala UDAF example.
In general, UDF logic should be as lean as possible, given that it will be called for each row. As an example, a step in the UDF logic taking 100 milliseconds to complete will quickly lead to major performance issues when scaling to 1 billion rows.
Definitely worth a read. UDFs in Spark can come at a performance penalty, so they aren’t free.
Fortunately, Structured Streaming makes it easy to convert these periodic batch jobs to a real-time data pipeline. Streaming jobs are expressed using the same APIs as batch data. Additionally, the engine provides the same fault-tolerance and data consistency guarantees as periodic batch jobs, while providing much lower end-to-end latency.
In the rest of post, we dive into the details of how we transform AWS CloudTrail audit logs into an efficient, partitioned, parquet data warehouse. AWS CloudTrail allows us to track all actions performed in a variety of AWS accounts, by delivering gzipped JSON logs files to a S3 bucket. These files enable a variety of business and mission critical intelligence, such as cost attribution and security monitoring. However, in their original form, they are very costly to query, even with the capabilities of Apache Spark. To enable rapid insight, we run a Continuous Application that transforms the raw JSON logs files into an optimized Parquet table. Let’s dive in and look at how to write this pipeline. If you want to see the full code, here are the Scala and Python notebooks. Import them into Databricks and run them yourselves.
This introductory post discusses some of the architecture and setup, and they promise additional posts getting into finer details.
The main issues for these applications were caused by trying to run a development system’s code, tested on AWS instances on a physical, on-premise cluster running on real data. The original developer was never given access to the production cluster or the real data.
Apache Ignite was a huge source of problems, principally because it is such a new project that nobody had any real experience with it and also because it is not a very mature project yet.
I found this article fascinating, particularly because the answer was a lot more than just “throw some more hardware at the problem.”