Spark – Page 70 – Curated SQL

SparkR + Zeppelin

Published 2016-07-21 by Kevin Feasel

I take a look at using SparkR and Zeppelin:

My goal is to do some of the things that I did in my Touching on Advanced Topics post. Originally, I wanted to replicate that analysis in its entirety using Zeppelin, but this proved to be pretty difficult, for reasons that I mention below. As a result, I was only able to do some—but not all—of the anticipated work. I think a more seasoned R / SparkR practitioner could do what I wanted, but that’s not me, at least not today.

With that in mind, let’s start messing around.

SparkR is a bit of a mindset change from traditional R.

Comments closed

Understanding Spark APIs

Published 2016-07-19 by Kevin Feasel

Jules Damji explains when to use RDDs, when to use DataFrames, and when to use Datasets in Spark:

Like an RDD, a DataFrame is an immutable distributed collection of data. Unlike an RDD, data is organized into named columns, like a table in a relational database. Designed to make large data sets processing even easier, DataFrame allows developers to impose a structure onto a distributed collection of data, allowing higher-level abstraction; it provides a domain specific language API to manipulate your distributed data; and makes Spark accessible to a wider audience, beyond specialized data engineers.

With Spark 2.0, the balance moves in favor of the more structured data types. What’s old is new; what’s unstructured is structured…

Comments closed

Why Learn Scala?

Published 2016-07-19 by Kevin Feasel

Kevin Jacobs explains why Scala is a useful language:

Is it a functional programming language? Is it an object-oriented programming language? The answer to both questions is yes! Scala is a object-functional programming language. The good old well-known stuff is all in Scala. You can build complex applications by the means of objects and classes. On the other hand, Scala tries to teach programmers a paradigm called functional programming. In functional programming, a computation is treated as the evaluation of a mathematical function. So in that sense, everything in Scala is an evaluation. You might wonder why you would ever need functional programming if you are used to object-oriented programming. Well, the case is that in imperative programming you are changing the state over and over again. This is not allowed in functional programming. The changing of the state causes side-effects and makes your application less transparent. A imperative application is therefore often hard to debug while a functional program is easy to debug since it does not change the state. A concrete example is given below

Scala is to Java as F# is to C#.

Comments closed

Going From Pig To Spark

Published 2016-07-14 by Kevin Feasel

Philippe de Cuzey introduces Spark to people already familiar with Pig:

I like to think of Pig as a high-level Map/Reduce commands pipeline. As a former SQL programmer, I find it quite intuitive, and at my organization our Hadoop jobs are still mostly developed in Pig.

Pig has a lot of qualities: it is stable, scales very well, and integrates natively with the Hive metastore HCatalog. By describing each step atomically, it minimizes conceptual bugs that you often find in complicated SQL code.

But sometimes, Pig has some limitations that makes it a poor programming paradigm to fit your needs.

Philippe includes a couple of examples in Pig, PySpark, and SparkSQL. Even if you aren’t familiar with Pig, this is a good article to help familiarize yourself with Spark.

Comments closed

Data Frame Partial Caching

Published 2016-07-11 by Kevin Feasel

Arijit Tarafdar shows how to capture partitions of a data frame in Spark, either horizontally or vertically:

In many Spark applications, performance benefit is obtained from caching the data if reused several times in the applications instead of reading them each time from persistent storage. However, there can be situations when the entire data cannot be cached in the cluster due to resource constraint in the cluster and/or the driver. In this blog we describe two schemes that can be used to partially cache the data by vertical and/or horizontal partitioning of the Distributed Data Frame (DDF) representing the data. Note that these schemes are application specific and are beneficial only if the cached part of the data is used multiple times in consecutive transformations or actions.

In the notebook we declare a Student case class with name, subject, major, school and year as members. The application is required to find out the number of students by name, subject, major, school and year.

Partitioning is an interesting idea for trying to speed up Spark performance by keeping everything in memory even when your entire data set is a bit too large.

Comments closed

Graphing Swear Words In Movies

Published 2016-07-06 by Kevin Feasel

Jos Dirksen uses Spark and D3 to count and graph swear words in movies:

So how do we do this? Well, the first thing to do is get the number of swearwords per minute. I mentioned that for the original article someone just counted every swearwords, in our case, we’re just going to parse a subtitle file, and extract the swear words from that.

Without going into too much detail, you can find the code I’ve experimtend with in this gist (it’s very ugly code, since I just hacked something together that worked).

Jos includes counts for four movies. This link does contain a few bad words, but if you get past that, it’s a good pattern for analyzing word counts in general.

Comments closed

HDInsight Tool For Eclipse

Published 2016-07-05 by Kevin Feasel

Xiaoyong Zhu reports that the HDInight tool for Eclipse is now generally available:

The HDInsight Tool for Eclipse extends Eclipse to allow you to create and develop HDInsight Spark applications and easily submit Spark jobs to Microsoft Azure HDInsight Spark clusters using the Eclipse development environment. It integrates seamlessly with Azure, enabling you to easily navigate HDInsight Spark clusters and to view associated Azure storage accounts. To further boost productivity, the HDInsight tool for Eclipse also offers the capability to view Spark job history and display detailed job logs.

Check out the link for videos and additional resources.

Comments closed

Getting Started With Spark

Published 2016-07-01 by Kevin Feasel

Denny Lee announces a new Spark intro guide:

We are proud to introduce the Getting Started with Apache Spark on Databricks Guide. This step-by-step guide illustrates how to leverage the Databricks’ platform to work with Apache Spark. Our just-in-time data platform simplifies common challenges when working with Spark: data integration, real-time experimentation, and robust deployment of production applications.

Databricks provides a simple, just-in-time data platform designed for data analysts, data scientists, and engineers. Using Databricks, this step-by-step guide helps you solve real-world Data Sciences and Data Engineering scenarios with Apache Spark. It will help you familiarize yourself with the Spark UI, learn how to create Spark jobs, load data and work with Datasets, get familiar with Spark’s DataFrames and Datasets API, run machine learning algorithms, and understand the basic concepts behind Spark Streaming.

If you are at all interested in distributed databases, Spark is a must-learn.

Comments closed

Getting Started With Spark

Published 2016-06-30 by Kevin Feasel

I discuss getting up and running with Databricks Community Edition:

There are a couple of notes with these clusters:

These are not powerful clusters. Don’t expect to crunch huge data sets with them. Notice that the cluster has only 6 GB of RAM, so you can expect to get maybe a few GB of data max.
The cluster will automatically terminate after one hour without activity. The paid version does not have this limitation.
You interact with the cluster using notebooks rather than opening a command prompt. In practice, this makes interacting with the cluster a little more difficult, as a good command prompt can provide features such as auto-complete.

Databricks Community Edition has a nice interface, is very easy to get up and running and—most importantly—is free. Read the whole thing.

Comments closed

Basic Spark Terminology

Published 2016-06-28 by Kevin Feasel

Denny Lee and Jules Damji explain some of the key terms and concepts around Apache Spark:

At the core of Apache Spark is the notion of data abstraction as distributed collection of objects. This data abstraction, called Resilient Distributed Dataset (RDD), allows you to write programs that transform these distributed datasets.

RDDs are immutable distributed collection of elements of your data that can be stored in memory or disk across a cluster of machines. The data is partitioned across machines in your cluster that can be operated in parallel with a low-level API that offers transformations and actions. RDDs are fault tolerant as they track data lineage information to rebuild lost data automatically on failure.

Some of these concepts are new to Spark 2.0, but all are worth learning.

Comments closed

M	T	W	T	F	S	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Category: Spark