Spark – Page 66 – Curated SQL

Word Count In Spark 2.0

Published 2017-01-16 by Kevin Feasel

Anubhav Tarar has a word count app for Spark 2.0:

Now you have to perform the given steps:

Create a spark session from org.apache.spark.sql.sparksession api and specify your master and app name
Using the sparksession.read.txt method, read from the file wordcount.txt the return value of this method in a dataset. In case you don’t know what a data set looks like you can learn from this link.
Split this dataset of type string with white space and create a map which contains the occurence of each word in that data set.
Create a class prettyPrintMap for printing the result to console.

This Hello World app is a bit longer than the sheer minimum code necessary, as it includes a class for formatting results and some error handling.

Comments closed

Submitting A Spark Job On HDInsight

Published 2017-01-09 by Kevin Feasel

Bharath Venkatesh shows different ways to run a Spark job on HDInsight:

From HDI 3.5 onwards, our clusters come preinstalled with Zeppelin Notebooks. Much like Jupyter notebooks, Zeppelin is a web-based notebook that enables interactive data analytics. It provides built-in Spark intergration that allows for:

Automatic SparkContext and SQLContext injection

Runtime jar dependency loading from local filesystem or maven repository. Learn more about dependency loader.

Canceling job and displaying its progress

This MSDN article provides a quick easy-to-use onboarding guide to help get acclimatized to Zeppelin. You can also try several applications that come pre-installed on your cluster to get hands on experience of Zeppelin.

Zeppelin is probably my favorite method, but there are good reasons to use all of these.

Comments closed

Monitoring Car Data With Spark And Kafka

Published 2017-01-06 by Kevin Feasel

Carol McDonald builds a model to determine where Uber cars are clustered:

Uber trip data is published to a MapR Streams topic using the Kafka API. A Spark streaming application, subscribed to the topic, enriches the data with the cluster Id corresponding to the location using a k-means model, and publishes the results in JSON format to another topic. A Spark streaming application subscribed to the second topic analyzes the JSON messages in real time.

This is a fairly detailed post, well worth the read.

Comments closed

Spark 2.1

Published 2017-01-03 by Kevin Feasel

Reynold Xin announces Apache Spark 2.1:

Structured Streaming

Introduced in Spark 2.0, Structured Streaming is a high-level API for building continuous applications. The main goal is to make it easier to build end-to-end streaming applications, which integrate with storage, serving systems, and batch jobs in a consistent and fault-tolerant way.
- Event-time watermarks: This change lets applications hint to the system when events are considered “too late” and allows the system to bound internal state tracking late events.
- Support for all file-based formats and all file-based features: With these improvements, Structured Streaming can read and write all file-based formats, e.g. JSON, text, Avro, CSV. In addition, all file-based features—e.g. partitioned files and bucketing—are supported on all formats.
- Apache Kafka 0.10: This adds native support for Kafka 0.10, including manual assignment of starting offsets and rate limiting.

This is a pretty hefty release. Click through to read the whole thing.

Comments closed

Ten Notes On SparkR

Published 2017-01-03 by Kevin Feasel

Neil Dewar has a notebook with ten important things when migrating from R to SparkR:

Apache Spark Building Blocks. A high-level overview of Spark describes what is available for the R user.
SparkContext, SQLContext, and SparkSession. In Spark 1.x, SparkContext and SQLContext let you access Spark. In Spark 2.x, SparkSession becomes the primary method.
A DataFrame or a data.frame? Spark’s distributed DataFrame is different from R’s local data.frame. Knowing the differences lets you avoid simple mistakes.
Distributed Processing 101. Understanding the mechanics of Big Data processing helps you write efficient code—and not blow up your cluster’s master node.
Function Masking. Like all R libraries, SparkR masks some functions.
Specifying Rows. With Big Data and Spark, you generally select rows in DataFrames differently than in local R data.frames.
Sampling. Sample data in the right way, and use it as a tool for converting between big and small data.
Machine Learning. SparkR has a growing library of distributed ML algorithms.
Visualization.It can be hard to visualize big data, but there are tricks and tools which help.
Understanding Error Messages. For R users, Spark error messages can be daunting. Knowing how to parse them helps you find the relevant parts.

I highly recommend checking out the notebook.

Comments closed

Building A Multi-Node Hadoop Cluster With Spark

Published 2016-12-27 by Kevin Feasel

Rao Swati has a step-by-step instruction guide on how to set up a multi-node cluster with Hadoop 2.7.3 and Spark 1.6.2:

Important Notes:

Start-dfs.sh will start NameNode, SecondaryNamenode, DataNode on master and DataNode on all slaves node.

Start-yarn.sh will start NodeManager, ResourceManager on the master node and NodeManager on slaves.

Perform Hadoop namenode -format only once otherwise you will get an incompatible cluster_id exception. To resolve this error clear temporary data location for datanode i.e, remove the files present in $HADOOP_HOME/dfs/name/data folder.

If you’d like to set up your own Hadoop cluster rather than using one of the big vendors (Hortonworks, Cloudera, MapR) or a PaaS solution like HDInsight or ElasticMapReduce, this will give you a head start.

Comments closed

Getting Finer-Grained Security In Spark

Published 2016-12-22 by Kevin Feasel

Vadim Vaks explains how to get finer-grained permissions within Spark using Ranger and LLAP:

With LLAP enabled, Spark reads from HDFS go directly through LLAP. Besides conferring all of the aforementioned benefits on Spark, LLAP is also a natural place to enforce fine grain security policies. The only other capability required is a centralized authorization system. This need is met by Apache Ranger. Apache Ranger provides centralized authorization and audit services for many components that run on Yarn or rely on data from HDFS. Ranger allows authoring of security policies for: – HDFS – Yarn – Hive (Spark with LLAP) – HBase – Kafka – Storm – Solr – Atlas – Knox Each of the above services integrate with Ranger via a plugin that pulls the latest security policies, caches them, and then applies them at run time.

Read on for more details.

Comments closed

Spark Versus Flink

Published 2016-12-21 by Kevin Feasel

Sibanjan Das compares Apache Flink to Apache Spark:

The primitive concept of Apache Flink is the high-throughput and low-latency stream processing framework which also supports batch processing. The architecture is a flip of the other Big Data processing architectures where the primary notion was the batch processing framework. This is something that organizations have been looking for over the last decade. There is a need for platforms supporting low latency data movement for applications where even a millisecond delay can lead to severe consequences. The prospect of Apache Flink seems to be significant and looks like the goal for stream processing.

While comparing these two, don’t forget about Kafka Streams. We’ve entered the streaming era for Hadoop & friends, and it’s an exciting time.

Comments closed

Debugging Spark In HDInsight

Published 2016-12-20 by Kevin Feasel

Sajib Mahmood gives various methods for debugging Spark applications running on an HDInsight cluster:

Spark Application Master

To access Spark UI for the running application and get more detailed information on its execution use the Application Master link and navigate through different tabs containing more information on jobs, stages, executors and so on.

These methods also apply for on-prem Spark clusters, although the resource locations might be a little different.

Comments closed

Partition Handling In Spark 2.1

Published 2016-12-19 by Kevin Feasel

Eric Liang, et al, discuss a change to Spark 2.1 which will make certain partitioned table access faster:

In Spark 2.1, we drastically improve the initial latency of queries that touch a small fraction of table partitions. In some cases, queries that took tens of minutes on a fresh Spark cluster now execute in seconds. Our improvements cut down on table memory overheads, and make the SQL experience starting cold comparable to that on a “hot” cluster with table metadata fully cached in memory.

This looks like a nice improvement in Spark.

Comments closed

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Category: Spark

Structured Streaming

Important Notes: