Press "Enter" to skip to content

Category: Spark

Spark 2.0 Out

Apache Spark 2.0 has officially been released.  Vinay Shukla gives us some highlights:

Performance
Project Tungsten has completed another major phase and with new completely new stage code generation, significant performance improvements have been delivered. Parquet and ORC file processing have also delivered performance improvements.

Databricks Community Edition offers (tiny) free clusters with Spark 2.0 on top of Scala 2.10 and Scala 2.11.

Comments closed

Using The Spark-HBase Connector

Anunay Tiwari shows how to use the Spark-HBase connector in HDInsight:

The Spark-Hbase Connector provides an easy way to store and access data from HBase clusters with Spark jobs. HBase is really successful for highest level of data scale needs. Thus, existing Spark customers should definitely explore this storage option. Similarly, if the customers are already having HDinsight HBase clusters and they want to access their data by Spark jobs then there is no need to move data to any other storage medium. In both the cases, the connector will be extremely useful.

I’m not the biggest fan of HBase, but if it’s part of your environment, you should definitely look at this Spark connector.

Comments closed

Clustering With Spark

Konur Unyelioglu shows how to implement k-means and Guassian clustering techniques in Apache Spark using MLlib:

Clustering is the task of assigning entities into groups based on similarities among those entities. The goal is to construct clusters in such a way that entities in one cluster are more closely related, i.e. similar to each other than entities in other clusters. As opposed to classification problems where the goal is to learn based on examples, clustering involves learning based on observation. For this reason, it is a form of unsupervised learning task.

There are many different clustering algorithms and a central notion in all of those is the definition of ’similarity’ between the entities that are being grouped. Different clustering algorithms may have different ways of measuring the similarity. In many clustering algorithms, another common notion is the so-called cluster center, which is a basis to represent the cluster. For example, in K-means clustering algorithm, the cluster center is the arithmetic mean position of all the points in that cluster.

This is a fairly lengthy article but if you want to get into machine learning with Spark, it’s a good one.

Comments closed

SparkR + Zeppelin

I take a look at using SparkR and Zeppelin:

My goal is to do some of the things that I did in my Touching on Advanced Topics post.  Originally, I wanted to replicate that analysis in its entirety using Zeppelin, but this proved to be pretty difficult, for reasons that I mention below.  As a result, I was only able to do some—but not all—of the anticipated work.  I think a more seasoned R / SparkR practitioner could do what I wanted, but that’s not me, at least not today.

With that in mind, let’s start messing around.

SparkR is a bit of a mindset change from traditional R.

Comments closed

Understanding Spark APIs

Jules Damji explains when to use RDDs, when to use DataFrames, and when to use Datasets in Spark:

Like an RDD, a DataFrame is an immutable distributed collection of data. Unlike an RDD, data is organized into named columns, like a table in a relational database. Designed to make large data sets processing even easier, DataFrame allows developers to impose a structure onto a distributed collection of data, allowing higher-level abstraction; it provides a domain specific language API to manipulate your distributed data; and makes Spark accessible to a wider audience, beyond specialized data engineers.

With Spark 2.0, the balance moves in favor of the more structured data types.  What’s old is new; what’s unstructured is structured…

Comments closed

Why Learn Scala?

Kevin Jacobs explains why Scala is a useful language:

Is it a functional programming language? Is it an object-oriented programming language? The answer to both questions is yes! Scala is a object-functional programming language. The good old well-known stuff is all in Scala. You can build complex applications by the means of objects and classes. On the other hand, Scala tries to teach programmers a paradigm called functional programming. In functional programming, a computation is treated as the evaluation of a mathematical function. So in that sense, everything in Scala is an evaluation. You might wonder why you would ever need functional programming if you are used to object-oriented programming. Well, the case is that in imperative programming you are changing the state over and over again. This is not allowed in functional programming. The changing of the state causes side-effects and makes your application less transparent. A imperative application is therefore often hard to debug while a functional program is easy to debug since it does not change the state. A concrete example is given below

Scala is to Java as F# is to C#.

Comments closed

Going From Pig To Spark

Philippe de Cuzey introduces Spark to people already familiar with Pig:

I like to think of Pig as a high-level Map/Reduce commands pipeline. As a former SQL programmer, I find it quite intuitive, and at my organization our Hadoop jobs are still mostly developed in Pig.

Pig has a lot of qualities: it is stable, scales very well, and integrates natively with the Hive metastore HCatalog. By describing each step atomically, it minimizes conceptual bugs that you often find in complicated SQL code.

But sometimes, Pig has some limitations that makes it a poor programming paradigm to fit your needs.

Philippe includes a couple of examples in Pig, PySpark, and SparkSQL.  Even if you aren’t familiar with Pig, this is a good article to help familiarize yourself with Spark.

Comments closed

Data Frame Partial Caching

Arijit Tarafdar shows how to capture partitions of a data frame in Spark, either horizontally or vertically:

In many Spark applications, performance benefit is obtained from caching the data if reused several times in the applications instead of reading them each time from persistent storage. However, there can be situations when the entire data cannot be cached in the cluster due to resource constraint in the cluster and/or the driver. In this blog we describe two schemes that can be used to partially cache the data by vertical and/or horizontal partitioning of the Distributed Data Frame (DDF) representing the data. Note that these schemes are application specific and are beneficial only if the cached part of the data is used multiple times in consecutive transformations or actions.

In the notebook we declare a Student case class with name, subject, major, school and year as members. The application is required to find out the number of students by name, subject, major, school and year.

Partitioning is an interesting idea for trying to speed up Spark performance by keeping everything in memory even when your entire data set is a bit too large.

Comments closed

Graphing Swear Words In Movies

Jos Dirksen uses Spark and D3 to count and graph swear words in movies:

So how do we do this? Well, the first thing to do is get the number of swearwords per minute. I mentioned that for the original article someone just counted every swearwords, in our case, we’re just going to parse a subtitle file, and extract the swear words from that.

Without going into too much detail, you can find the code I’ve experimtend with in this gist (it’s very ugly code, since I just hacked something together that worked).

Jos includes counts for four movies.  This link does contain a few bad words, but if you get past that, it’s a good pattern for analyzing word counts in general.

Comments closed

HDInsight Tool For Eclipse

Xiaoyong Zhu reports that the HDInight tool for Eclipse is now generally available:

The HDInsight Tool for Eclipse extends Eclipse to allow you to create and develop HDInsight Spark applications and easily submit Spark jobs to Microsoft Azure HDInsight Spark clusters using the Eclipse development environment.  It integrates seamlessly with Azure, enabling you to easily navigate HDInsight Spark clusters and to view associated Azure storage accounts. To further boost productivity, the HDInsight tool for Eclipse also offers the capability to view Spark job history and display detailed job logs.

Check out the link for videos and additional resources.

Comments closed