Spark – Page 67 – Curated SQL

In a similar spirit to how sparklyr allowed us to reuse our functions from the dplyr package to manipulate Spark DataFrames, the RxSpark API allows a data scientist to develop code that can be deployed in a multitude of environments. This allows the developer to shift their focus from writing code that’s specific to a certain environment, and instead focus on the complex analysis of their data science problem. We call this flexibility Write Once, Deploy Anywhere, or WODA for the acronym lovers.

For a deeper dive into the RevoScaleR package, I recommend you take a look at the online course, Analyzing Big Data with Microsoft R Server. Much of this blogpost follows along the last section of the course, on deployment to Spark.

R isn’t just for small, one-off jobs anymore.

Comments closed

ETL With Spark

Published 2016-12-09 by Kevin Feasel

Eric Maynard demonstrates that moving data across Hadoop clusters can be sped up by using Spark:

By leveraging Spark for distribution, we can achieve the same results much more quickly and with the same amount of code. By keeping data in HDFS throughout the process, we were able to ingest the same data as before in about 36 seconds. Let’s take a look at Spark code which produced equivalent results as the bash script shown above — note that a more parameterized version of this code code and of all code referenced in this article can be found down below in the Resources section.

Read the whole thing.

Comments closed

Installing Zeppelin On Windows 10

Published 2016-11-14 by Kevin Feasel

Paul Hernandez shows how to install Apache Zeppelin on Windows 10:

There are several settings you can adjust. Basically, there are two main files in the ZEPPELIN_DIR\conf :

zeppelin-env

zeppelin-site.xml

In the first one you can configure some interpreter settings. In the second more aspects related to the Website, like for instance, the Zeppelin server port (I am using the 8080 but most probably yours is already used by another application)

This is a very clear walkthrough. Jupyter is still easier to install, but Paul’s blog post lowers that Zeppelin installation learning curve.

Comments closed

Using Spark MLlib For Categorization

Published 2016-11-04 by Kevin Feasel

Taras Matyashovskyy uses Apache Spark MLlib to categorize songs in different genres:

The roadmap for implementation was pretty straightforward:

Collect the raw data set of the lyrics (~65k sentences in total):
- Black Sabbath, In Flames, Iron Maiden, Metallica, Moonspell, Nightwish, Sentenced, etc.
- Abba, Ace of Base, Backstreet Boys, Britney Spears, Christina Aguilera, Madonna, etc.
Create training set, i.e. label (0 for metal | 1 for pop) + features (represented as double vectors)
Train logistic regression that is the obvious selection for the classification

This is a supervised learning problem, and is pretty fun to walk through.

Comments closed

Spark Clusters On Spot Pricing

Published 2016-11-04 by Kevin Feasel

Sameer Farooqui explains spot pricing with respect to AWS servers:

The idea behind Spot instances is to allow you to bid on spare Amazon EC2 compute capacity. You choose the max price you’re willing to pay per EC2 instance hour. If your bid meets or exceeds the Spot market price, you win the Spot instances. However, unlike traditional bidding, when your Spot instances start running, you pay the live Spot market price (not your bid amount). Spot prices fluctuate based on the supply and demand of available EC2 compute capacity and are specific to different regions and availability zones.

So, although you may have bid 0.55 cents per hour for a r3.2xlarge instance, you’ll end up paying only 0.10 cents an hour if that’s what the going rate is for the region and availability zone.

Databricks uses spot pricing for Community Edition clusters to control costs. Click through for a very interesting discussion of spot pricing and how they take advantage of it.

Comments closed

Maximum Temperatures With Spark Languages

Published 2016-11-04 by Kevin Feasel

Praveen Sripati has a two-part series on getting aggregates by year in various Spark languages. In part one, he looks at Python:

Hadoop – The Definitive Guide revolves around the example of finding the maximum temperature for a particular year from the weather data set. The code for the same is here and the data here. Below is the Spark code implemented in Python for the same.

In part 2, he looks at Spark SQL:

In the previous blog, we looked at how find out the maximum temperature of each year from the weather dataset. Below is the code for the same using Spark SQL which is a layer on top of Spark. SQL on Spark was supported using Shark which is being replaced by Spark SQL.Here is a nice blog from DataBricks on the future of SQL on Spark.

There’s no Scala example here, but it’s pretty straightforward as well.

Comments closed

Spark And .NET

Published 2016-11-04 by Kevin Feasel

Bharath Venkatesh shows how to make Spark calls using the .NET ODBC driver:

Prerequisite

Before you begin, you must have the following:

An Azure Subscription: See Get Azure free trial.
A HDInsight cluster: See Get Started with HDInsight on Linux.
Visual Studio 2015: See Get Visual Studio 2015.
Microsoft Spark ODBC driver: HDInsight Spark uses Simba driver to connect to the cluster and run queries remotely. Download the driver here.

Check it out. Using Spark on .NET is pretty easy.

Comments closed

Basics Of Spark

Published 2016-11-01 by Kevin Feasel

Jen Underwood gives a quick explanation of Spark as well as an introduction to SparkSQL and PySpark:

Spark’s distributed data-sharing concept is called “Resilient Distributed Datasets,” or RDD. RDDs are fault-tolerant collections of objects partitioned across a cluster that can be queried in parallel and used in a variety of workload types. RDDs are created by applying operations called “transformations” with map, filter, and groupBy clauses. They can persist in memory for rapid reuse. If an RDD data does not fit in memory, Spark will overflow it to disk.

If you’re not familiar with Spark, now’s as good a time as any to learn.

Comments closed

Debugging Spark Code

Published 2016-10-25 by Kevin Feasel

Vida Ha has an article on troubleshooting when writing code using the Spark APIs:

When working with large datasets, you will have bad input that is malformed or not as you would expect it. I recommend being proactive about deciding for your use case, whether you can drop any bad input, or you want to try fixing and recovering, or otherwise investigating why your input data is bad.

A filter command is a great way to get only your good input points or your bad input data (If you want to look into that more and debug). If you want to fix your input data or to drop it if you cannot, then using a flatMap() operation is a great way to accomplish that.

This is a good set of tips.

Comments closed

Sparklyr On EMR

Published 2016-10-19 by Kevin Feasel

Tom Zeng shows how to use sparklyr on Amazon ElasticMapReduce:

The recently released sparklyr package by RStudio has made processing big data in R a lot easier. sparklyr is an R interface to Spark that allows users to use Spark as the backend for dplyr, one of the most popular data manipulation packages. sparklyr provides interfaces to Spark packages and also allows users to query data in Spark using SQL and develop extensions for the full Spark API.

You can also install sparklyr locally and point to a Spark cluster.

Comments closed

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Category: Spark

Analyzing Taxi Data With Microsoft R Server

ETL With Spark

Installing Zeppelin On Windows 10

Using Spark MLlib For Categorization

Spark Clusters On Spot Pricing

Maximum Temperatures With Spark Languages

Spark And .NET

Prerequisite

Basics Of Spark

Debugging Spark Code

Sparklyr On EMR