Controlling the lifecycle of Spark can be cumbersome and tedious. Fortunately, Spark Testing Baseproject offers us Scala Traits that handle those low-level details for us. Streaming has an extra bit of complexity as we need to produce data for ingestion in a timely way. At the same time, Spark internal clock needs to tick in a controlled way if we want to test timed operations as sliding windows.
This is part one of a series. I’m interesting in seeing where this goes.
The basic steps can be described as follows:
When a Spark job starts, it will generate encryption keys and store them in the current user’s credentials, which are shared with all executors.
When shuffle happens, the shuffle writer will first compress the plaintext if compression is enabled. Spark will use the randomly generated Initial Vector (IV) and keys obtained from the credentials to encrypt the plaintext by using
CryptoOutputStreamwill encrypt the shuffle data and write it to the disk as it arrives. The first 16 bytes of the encrypted output file are preserved to store the initial vector.
For the read path, the first 16 bytes are used to initialize the IV, which is provided to
CryptoInputStreamalong with the user’s credentials. The decrypted data is then provided to Spark’s shuffle mechanism for further processing.
Once you have things optimized, the performance hit is surprisingly small.
Once the cluster is created, you can connect to the edge node where MRS is already pre-installed by SSHing to r-server.YOURCLUSTERNAME-ssh.azurehdinsight.net with the credentials which you supplied during the cluster creation process. In order to do this in MobaXterm, you can go to Sessions, then New Sessions and then SSH.
The default installation of HDI Spark on Linux cluster does not come with RStudio Server installed on the edge node. RStudio Server is a popular open source integrated development environment (IDE) available for R that provides a browser-based IDE for use by remote clients. This tool allows you to benefit from all the power of R, Spark and Microsoft HDInsight cluster through your browser. In order to install RStudio you can follow the steps detailed in the guide, which reduces to running a script on the edge node.
If you’ve been meaning to get further into Spark & R, this is a great article to follow along with on your own.
The C# language binding to Spark is similar to the Python and R bindings. In fact, Mobius follows the same design pattern and leverages the existing implementation of language binding components in Spark where applicable for consistency and reuse. The following picture shows the dependency between the .NET application and the C# API in Mobius, which internally depends on Spark’s public API in Scala and Java and extends PythonRDD from PySpark to implement CSharpRDD.
Looks like there’s some fuzziness on just how well F# is supported. Still, this is very exciting as a way of bridging the gap for .NET developers.
In Structured Streaming, we tackle the issue of semantics head-on by making a strong guarantee about the system: at any time, the output of the application is equivalent to executing a batch job on a prefix of the data. For example, in our monitoring application, the result table in MySQL will always be equivalent to taking a prefix of each phone’s update stream (whatever data made it to the system so far) and running the SQL query we showed above. There will never be “open” events counted faster than “close” events, duplicate updates on failure, etc. Structured Streaming automatically handles consistency and reliability both within the engine and in interactions with external systems (e.g. updating MySQL transactionally).
If you want to learn more about streaming data using Spark, check this out.
Reason 3: Poor transparency for teammates
Which jobs are running right now? Which are going to run today? How long do these jobs take? How do I schedule my job? What machine should I schedule it on? These are all questions that are impossible to answer without building custom orchestration around your Cron process – time you’d be better off spending on building a better system.
Matthew then gives us four alternative products.
If you are familiar with traditional Spark streaming you may notice that the above example is lacking an explicit batch duration. In structured streaming the equivalent feature is a trigger. By default it will run batches as quickly as possible, starting the next batch as soon as more data is available and the previous batch is complete. You can also set a more traditional fixed batch interval for your trigger. In the future more flexible trigger options will be added.
A related consequence is that windows are no longer forced to be a multiple of the batch duration. Furthermore, windows needn’t be only on processing time anymore, we can rearrange events that may have been delayed or arrived out of order and window by event time. Suppose our input stream had a column event_time that we wanted to do windowed counts on. Then we could do something like the following to get counts of events in a 1 minute window:
Right now, there are some pretty strict limitations on this new streaming, but I imagine they’ll loosen up quite soon.
Apache Spark 2.0 has officially been released. Vinay Shukla gives us some highlights:
Project Tungsten has completed another major phase and with new completely new stage code generation, significant performance improvements have been delivered. Parquet and ORC file processing have also delivered performance improvements.
Databricks Community Edition offers (tiny) free clusters with Spark 2.0 on top of Scala 2.10 and Scala 2.11.
The Spark-Hbase Connector provides an easy way to store and access data from HBase clusters with Spark jobs. HBase is really successful for highest level of data scale needs. Thus, existing Spark customers should definitely explore this storage option. Similarly, if the customers are already having HDinsight HBase clusters and they want to access their data by Spark jobs then there is no need to move data to any other storage medium. In both the cases, the connector will be extremely useful.
I’m not the biggest fan of HBase, but if it’s part of your environment, you should definitely look at this Spark connector.
Clustering is the task of assigning entities into groups based on similarities among those entities. The goal is to construct clusters in such a way that entities in one cluster are more closely related, i.e. similar to each other than entities in other clusters. As opposed to classification problems where the goal is to learn based on examples, clustering involves learning based on observation. For this reason, it is a form of unsupervised learning task.
There are many different clustering algorithms and a central notion in all of those is the definition of ’similarity’ between the entities that are being grouped. Different clustering algorithms may have different ways of measuring the similarity. In many clustering algorithms, another common notion is the so-called cluster center, which is a basis to represent the cluster. For example, in K-means clustering algorithm, the cluster center is the arithmetic mean position of all the points in that cluster.
This is a fairly lengthy article but if you want to get into machine learning with Spark, it’s a good one.