Category: Spark

With Databricks RStudio Integration, both popular R packages for interacting with Apache Spark, SparkR or sparklyr can be used the inside the RStudio IDE on Databricks. When multiple users use a cluster, each creates a separate SparkR Context or sparklyr connection, but they are all talking to a single Databricks managed Spark application allowing unique opportunities for collaboration between users. Together, RStudio can take advantage of Databricks’ cluster management and Apache Spark to perform such as a massive model selection as noted in the figure below.

I like seeing this level of integration, especially from a language like R, which has historically been limited to operating on a single machine’s memory.

Comments closed

Tuning Spark Jobs Running On YARN

Published 2018-06-25 by Kevin Feasel

Anubhav Tarar shows us ways of optimizing YARN to run Apache Spark jobs:

1. yarn-client mode: In client mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN. To manage the memory first make sure that you have your yarn-site.xml in spark,

spark.yarn.am.memory: To increase the memory you should set spark.yarn.am.memory property in spark-defaults.conf but make sure that you do not allocate more memory than capacity of node manager which is defined in yarn-site.xml as yarn.nodemanager.resource.memory-mb or you can also give it when you are running spark submit with –conf parameter

For example $SPARK_HOME/bin/spark-submit –conf spark.yarn.am.memory=1024m

Check it out for a few other configuration settings you can tweak.

Comments closed

Understanding A Spark Streaming Workflow

Published 2018-06-18 by Kevin Feasel

Himanshu Gupta continues a series on structured streaming using Spark Streaming:

Here we can clearly see that if new data is pushed to the source, Spark will run the “incremental” query that combines the previous running counts with the new data to compute updated counts. The “Input Table” here is the lines DataFrame which acts as a streaming input for wordCounts DataFrame.

Now, the only unknown thing in the above diagram is “Complete Mode“. It is nothing but one of the 3 output modes available in Structured Streaming. Since they are an important part of Structured Streaming, so, let’s read about them in detail:

Complete Mode – This mode updates the entire Result Table which is eventually written to the sink.
Append Mode – In this mode, only the new rows are appended in the Result Table and eventually sent to the sink.
Update Mode – At last, this mode updates only the rows that are changed in the Result Table since the last trigger. Also, only the new rows are sent to the sink. There is one peculiar thing to note about this mode, i.e., it is different from the Complete Mode in the way that this mode only outputs the rows that have changed since the last trigger. If the query doesn’t contain any aggregations, it is equivalent to the Append mode.

Check it out.

Comments closed

Calculating TF-IDF Using Apache Spark

Published 2018-06-18 by Kevin Feasel

Arseniy Tashoyan shows us how to calculate Term Frequency-Inverse Document Frequency using Apache Spark:

TF-IDF is used in a large variety of applications. Typical use cases include:

Document search.

Document tagging.

Text preprocessing and feature vector engineering for Machine Learning algorithms.

There is a vast number of resources on the web explaining the concept itself and the calculation algorithm. This article does not repeat the information in these other Internet resources, it just illustrates TF-IDF calculation with help of Apache Spark. Emml Asimadi, in his excellent article Understanding TF-IDF, shares an approach based on the old Spark RDD and the Python language. This article, on the other hand, uses the modern Spark SQL API and Scala language.

Although Spark MLlib has an API to calculate TF-IDF, this API is not convenient to learn the concept. MLlib tools are intended to generate feature vectors for ML algorithms. There is no way to figure out the weight for a particular term in a particular document. Well, let’s make it from scratch, this will sharpen our skills.

Read on for the solution. It seems that there tend to be better options today than TF-IDF for natural language problems, but it’s an easy algorithm to understand, so it’s useful as a first go.

Comments closed

Exception Handling In Scala

Published 2018-06-12 by Kevin Feasel

Shivangi Gupta shows off the Either keyword in Scala:

How to get values from Either?

There are many ways we will talk about all one by one. One way to get values is by doing left and right projection. We can not perform any operation i.e, map, filter etc; on Either. Either provide left and right methods to get the left and right projection. Projection on either allows us to apply functions like map, filter etc.

For example,
scala> val div = divide(14, 7)
div: scala.util.Either[String,Int] = Right(2)

scala> div.right
res1: scala.util.Either.RightProjection[String,Int] = RightProjection(Right(2))
When we applied right on either, it returned RightProjection. Now we can extract the value from right projection using get, but if there is no value the compiler will blow up using get.

There’s more to Scala exception handling than just try-catch.

Comments closed

Flattening JSON Data With Databricks

Published 2018-06-11 by Kevin Feasel

Ivan Vazharov gives us a Databricks notebook to parse and flatten JSON using PySpark:

With Databricks you get:

An easy way to infer the JSON schema and avoid creating it manually

Subtle changes in the JSON schema won’t break things

The ability to explode nested lists into rows in a very easy way (see the Notebook below)

Speed!

Following is an example Databricks Notebook (Python) demonstrating the above claims. The JSON sample consists of an imaginary JSON result set, which contains a list of car models within a list of car vendors within a list of people. We want to flatten this result into a dataframe.

Click through for the notebook.

Comments closed

Databricks MLflow

Published 2018-06-06 by Kevin Feasel

Matai Zaharia announces a new Databricks offering:

MLflow is inspired by existing ML platforms, but it is designed to be open in two senses:

Open interface: MLflow is designed to work with any ML library, algorithm, deployment tool or language. It’s built around REST APIs and simple data formats (e.g., a model can be viewed as a lambda function) that can be used from a variety of tools, instead of only providing a small set of built-in functionality. This also makes it easy to add MLflow to your existing ML code so you can benefit from it immediately, and to share code using any ML library that others in your organization can run.

Open source: We’re releasing MLflow as an open source project that users and library developers can extend. In addition, MLflow’s open format makes it very easy to share workflow steps and models across organizations if you wish to open source your code.

Mlflow is still currently in alpha, but we believe that it already offers a useful framework to work with ML code, and we would love to hear your feedback. In this post, we’ll introduce MLflow in detail and explain its components.

Even in alpha, it looks nice.

Comments closed

Stream-To-Stream Joins In Spark

Published 2018-05-25 by Kevin Feasel

Ayush Tiwari shows how to join a pair of streams in Apache Spark 2.3:

In Spark 2.3, it added support for stream-stream joins, i.e, we can join two streaming Datasets/DataFrames and in this blog we are going to see how beautifully spark now give support for joining the two streaming dataframes.

I this example, I am going to use
Apache Spark 2.3.0
Apache Kafka 0.11.0.1
Scala 2.11.8

Click through for the demo.

Comments closed

Spark: DataFrame To RDD For Data Cleansing

Published 2018-05-24 by Kevin Feasel

Gilad Moscovitch walks us through a common data cleansing problem with Spark data frames:

A problem can arise when one of the inner fields of the json, has undesired non-json values in some of the records.

For instance, an inner field might contains HTTP errors, that would be interpreted as a string, rather than as a struct.

As a result, our schema would look like:

root

|– headers: struct (nullable = true)

| |– items: array (nullable = true)

| | |– element: struct (containsNull = true)

|– requestBody: string (nullable = true)

Instead of

root

|– headers: struct (nullable = true)

| |– items: array (nullable = true)

| | |– element: struct (containsNull = true)

|– requestBody: struct (nullable = true)

| |– items: array (nullable = true)

| | |– element: struct (containsNull = true)

When trying to explode a “string” type, we will get a miss type error:

org.apache.spark.sql.AnalysisException: Can’t extract value from requestBody#10

Click through to see how to handle this scenario cleanly.

Comments closed

Using The Spark Connector To Speed Up Data Loads

Published 2018-05-14 by Kevin Feasel

Denzil Riberio explains how you can use the Spark connector for Azure SQL DB and SQL Server to speed up inserting data from Spark into SQL Server 15x over the native JDBC client:

Since the load was taking longer than expected, we examined the sys.dm_exec_requests DMV while load was running, and saw that there was a fair amount of latch contention on various pages, which wouldn’t not be expected if data was being loaded via a bulk API.

Examining the statements being executed, we saw that the JDBC driver uses sp_prepare followed by sp_execute for each inserted row; therefore, the operation is not a bulk insert. One can further example the Spark JDBC connector source code, it builds a batch consisting of singleton insert statements, and then executes the batch via the prep/exec model.

It’s the power of bulk insertion.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31