Press "Enter" to skip to content

Category: Spark

Spark UDFs in Scala

Achilleus shows us how to create a user-defined function for Spark in Scala, as well as the performance drawbacks:

It is pretty straight forward and easy to create it in spark. Let’s say we have this customer data from Central Perk. If you look at the country data, it has a lot of discrepancies but we kinda know its the right country, it’s just that the way it is entered is not typical. Let’s say we need to normalize it to the USA that is similar with the help of a known dictionary.

The performance hit is often too much for me to accept, though that could just be that I write bad functions.

Comments closed

ETL with Spark and Hive

Emrah Mete gives us an example of using Apache Spark for ETL into Apache Hive:

Now let’s go to the construction of the sample application. In the example, we will first send the data from our Linux file system to the data storage unit of the Hadoop ecosystem (HDFS) (for example, Extraction). Then we will read the data we have written here with Spark and then we will apply a simple Transformation and write to Hive (Load). Hive is a substructure that allows us to query the data in the hadoop ecosystem, which is stored in this environment. With this infrastructure, we can easily query the data in our big data environment using SQL language.

Most of the things relational database professionals do are pretty much the same things that you do with Spark and Hive. There are differences in implementation and level of programming familiarity, but they’re pretty similar.

Comments closed

Processing Fixed-Width Files with Spark

Subhasish Guha shows how you can read a fixed-with file with Apache Spark:

A fixed width file is a very common flat file format when working with SAP, Mainframe, and Web Logs. Converting the data into a dataframe using metadata is always a challenge for Spark Developers. This particular article talks about all kinds of typical scenarios that a developer might face while working with a fixed witdth file. This solution is generic to any fixed width file and very easy to implement. This also takes care of the Tail Safe Stack as the RDD gets into the foldLeft operator.

It’s a little more complicated than with R, where stringr can handle fixed-width formats. But it’s not bad.

Comments closed

Sentiment Analysis with Spark on Qubole

Jonathan Day, et al, have a tutorial on using Qubole to build a sentiment analysis model:

This post covers the use of Qubole, Zeppelin, PySpark, and H2O PySparkling to develop a sentiment analysis model capable of providing real-time alerts on customer product reviews. In particular, this model allows users to monitor any natural language text (such as social media posts or Amazon reviews) and receive alerts when customers post extremely nice (high sentiment) or extremely negative (low sentiment) comments about their products.

In addition to introducing the frameworks used, we will also discuss the concepts of embedding spaces, sentiment analysis, deep neural networks, grid search, stop words, data visualization, and data preparation.

Click through for the demo.

Comments closed

Pivoting Spark DataFrames

Unmesha Sreeveni shows how we can pivot a DataFrame in Apache Spark using one line of code:

A pivot can be thought of as translating rows into columns while applying one or more aggregations.

Lets see how we can achieve the same using the above dataframe.

We will pivot the data based on “Item” column.

Click through for the code. This is an area where dropping back into Scala or Python is a lot more lines-of-code efficient than sticking to SQL.

Comments closed

Troubleshooting Spark Performance

Bikas Saha and Mridul Murlidharan explain some of the basics of performance tuning with Apache Spark:

Our objective was to build a system that would provide an intuitive insight into Spark jobs that not just provides visibility but also codifies the best practices and deep experience we have gained after years of debugging and optimizing Spark jobs. The main design objectives were to be
– Intuitive and easy – Big data practitioners should be able to navigate and ramp quickly
– Concise and focused – Hide the complexity and scale but present all necessary information in a way that does not overwhelm the end user
– Batteries included – Provide actionable recommendations for a self service experience, especially for users who are less familiar with Spark
– Extensible – To enable additions of deep dives for the most common and difficult scenarios as we come across them

The tool looks pretty interesting and I’m hoping it will be part of the open source suite at Cloudera.

Comments closed

Spark Memory Management on EMR

Karunanithi Shanmugam gives us some tips on memory management for Spark in Amazon’s ElasticMapReduce:

Amazon EMR provides high-level information on how it sets the default values for Spark parameters in the release guide. These values are automatically set in the spark-defaults settings based on the core and task instance types in the cluster.

To use all the resources available in a cluster, set the maximizeResourceAllocation parameter to true. This EMR-specific option calculates the maximum compute and memory resources available for an executor on an instance in the core instance group. It then sets these parameters in the spark-defaults settings. Even with this setting, generally the default numbers are low and the application doesn’t use the full strength of the cluster. For example, the default for spark.default.parallelism is only 2 x the number of virtual cores available, though parallelism can be higher for a large cluster.

Spark on YARN can dynamically scale the number of executors used for a Spark application based on the workloads. Using Amazon EMR release version 4.4.0 and later, dynamic allocation is enabled by default (as described in the Spark documentation).

There’s a lot in here, much of which applies to Spark in general and not just EMR.

Comments closed

CSV Data Ingestion with Spark

Jean Georges Perrin shows how you can easily load CSV data with Spark:

Fortunately for you, Apache Spark offers a variety of options for ingesting those CSV files. Ingesting CSV is easy and schema inference is a powerful feature.
Let’s have a look at more advanced examples with more options that illustrate the complexity of CSV files in the outside world. You’ll first look at the file you’ll ingest, and understand its specifications. You’ll then have a look at the result and finally build the mini-application to achieve the result. This pattern repeats for each format.

It’s good to see some of the lesser-used features pop up like date format and multi-line support (which I hadn’t even known about).

Comments closed

Benefits of Partitioning in Spark

The Hadoop in Real World team take a look at how appropriate partitioning can make your Spark jobs much faster:

Shuffle is an expensive operation whether you do it with plain old MapReduce programs or with Spark. Shuffle is he process of bringing Key Value pairs from different mappers (or tasks in Spark) by Key in to a single reducer (task in Spark). So all key value pairs of the same key will end up in one task (node). So we can loop through the key value pairs and do the needed aggregation.

Since production jobs usually involve a lot of tasks in Spark, the key value pairs movement between nodes during shuffle (from one task to another) will cause a significant bottleneck. In some cases Shuffle is not avoidable but in many instances you could avoid shuffle by structuring your data little differently. Avoiding shuffle will have an positive impact on performance.

Read the whole thing. Getting partitions right is critical to writing scalable Spark jobs.

Comments closed