Clustering is the task of assigning entities into groups based on similarities among those entities. The goal is to construct clusters in such a way that entities in one cluster are more closely related, i.e. similar to each other than entities in other clusters. As opposed to classification problems where the goal is to learn based on examples, clustering involves learning based on observation. For this reason, it is a form of unsupervised learning task.
There are many different clustering algorithms and a central notion in all of those is the definition of ’similarity’ between the entities that are being grouped. Different clustering algorithms may have different ways of measuring the similarity. In many clustering algorithms, another common notion is the so-called cluster center, which is a basis to represent the cluster. For example, in K-means clustering algorithm, the cluster center is the arithmetic mean position of all the points in that cluster.
This is a fairly lengthy article but if you want to get into machine learning with Spark, it’s a good one.
My goal is to do some of the things that I did in my Touching on Advanced Topics post. Originally, I wanted to replicate that analysis in its entirety using Zeppelin, but this proved to be pretty difficult, for reasons that I mention below. As a result, I was only able to do some—but not all—of the anticipated work. I think a more seasoned R / SparkR practitioner could do what I wanted, but that’s not me, at least not today.
With that in mind, let’s start messing around.
SparkR is a bit of a mindset change from traditional R.
Like an RDD, a DataFrame is an immutable distributed collection of data. Unlike an RDD, data is organized into named columns, like a table in a relational database. Designed to make large data sets processing even easier, DataFrame allows developers to impose a structure onto a distributed collection of data, allowing higher-level abstraction; it provides a domain specific language API to manipulate your distributed data; and makes Spark accessible to a wider audience, beyond specialized data engineers.
With Spark 2.0, the balance moves in favor of the more structured data types. What’s old is new; what’s unstructured is structured…
Is it a functional programming language? Is it an object-oriented programming language? The answer to both questions is yes! Scala is a object-functional programming language. The good old well-known stuff is all in Scala. You can build complex applications by the means of objects and classes. On the other hand, Scala tries to teach programmers a paradigm called functional programming. In functional programming, a computation is treated as the evaluation of a mathematical function. So in that sense, everything in Scala is an evaluation. You might wonder why you would ever need functional programming if you are used to object-oriented programming. Well, the case is that in imperative programming you are changing the state over and over again. This is not allowed in functional programming. The changing of the state causes side-effects and makes your application less transparent. A imperative application is therefore often hard to debug while a functional program is easy to debug since it does not change the state. A concrete example is given below
Scala is to Java as F# is to C#.
I like to think of Pig as a high-level Map/Reduce commands pipeline. As a former SQL programmer, I find it quite intuitive, and at my organization our Hadoop jobs are still mostly developed in Pig.
Pig has a lot of qualities: it is stable, scales very well, and integrates natively with the Hive metastore HCatalog. By describing each step atomically, it minimizes conceptual bugs that you often find in complicated SQL code.
But sometimes, Pig has some limitations that makes it a poor programming paradigm to fit your needs.
Philippe includes a couple of examples in Pig, PySpark, and SparkSQL. Even if you aren’t familiar with Pig, this is a good article to help familiarize yourself with Spark.
In many Spark applications, performance benefit is obtained from caching the data if reused several times in the applications instead of reading them each time from persistent storage. However, there can be situations when the entire data cannot be cached in the cluster due to resource constraint in the cluster and/or the driver. In this blog we describe two schemes that can be used to partially cache the data by vertical and/or horizontal partitioning of the Distributed Data Frame (DDF) representing the data. Note that these schemes are application specific and are beneficial only if the cached part of the data is used multiple times in consecutive transformations or actions.
In the notebook we declare a
Student case classwith
yearas members. The application is required to find out the number of students by
Partitioning is an interesting idea for trying to speed up Spark performance by keeping everything in memory even when your entire data set is a bit too large.
So how do we do this? Well, the first thing to do is get the number of swearwords per minute. I mentioned that for the original article someone just counted every swearwords, in our case, we’re just going to parse a subtitle file, and extract the swear words from that.
Without going into too much detail, you can find the code I’ve experimtend with in this gist (it’s very ugly code, since I just hacked something together that worked).
Jos includes counts for four movies. This link does contain a few bad words, but if you get past that, it’s a good pattern for analyzing word counts in general.
The HDInsight Tool for Eclipse extends Eclipse to allow you to create and develop HDInsight Spark applications and easily submit Spark jobs to Microsoft Azure HDInsight Spark clusters using the Eclipse development environment. It integrates seamlessly with Azure, enabling you to easily navigate HDInsight Spark clusters and to view associated Azure storage accounts. To further boost productivity, the HDInsight tool for Eclipse also offers the capability to view Spark job history and display detailed job logs.
Check out the link for videos and additional resources.
We are proud to introduce the Getting Started with Apache Spark on Databricks Guide. This step-by-step guide illustrates how to leverage the Databricks’ platform to work with Apache Spark. Our just-in-time data platform simplifies common challenges when working with Spark: data integration, real-time experimentation, and robust deployment of production applications.
Databricks provides a simple, just-in-time data platform designed for data analysts, data scientists, and engineers. Using Databricks, this step-by-step guide helps you solve real-world Data Sciences and Data Engineering scenarios with Apache Spark. It will help you familiarize yourself with the Spark UI, learn how to create Spark jobs, load data and work with Datasets, get familiar with Spark’s DataFrames and Datasets API, run machine learning algorithms, and understand the basic concepts behind Spark Streaming.
If you are at all interested in distributed databases, Spark is a must-learn.
There are a couple of notes with these clusters:
These are not powerful clusters. Don’t expect to crunch huge data sets with them. Notice that the cluster has only 6 GB of RAM, so you can expect to get maybe a few GB of data max.
The cluster will automatically terminate after one hour without activity. The paid version does not have this limitation.
You interact with the cluster using notebooks rather than opening a command prompt. In practice, this makes interacting with the cluster a little more difficult, as a good command prompt can provide features such as auto-complete.
Databricks Community Edition has a nice interface, is very easy to get up and running and—most importantly—is free. Read the whole thing.