Category: Hadoop

Processing Fixed-Width Files with Spark

Published 2019-04-24 by Kevin Feasel

Subhasish Guha shows how you can read a fixed-with file with Apache Spark:

A fixed width file is a very common flat file format when working with SAP, Mainframe, and Web Logs. Converting the data into a dataframe using metadata is always a challenge for Spark Developers. This particular article talks about all kinds of typical scenarios that a developer might face while working with a fixed witdth file. This solution is generic to any fixed width file and very easy to implement. This also takes care of the Tail Safe Stack as the RDD gets into the foldLeft operator.

It’s a little more complicated than with R, where stringr can handle fixed-width formats. But it’s not bad.

Comments closed

Sentiment Analysis with Spark on Qubole

Published 2019-04-24 by Kevin Feasel

Jonathan Day, et al, have a tutorial on using Qubole to build a sentiment analysis model:

This post covers the use of Qubole, Zeppelin, PySpark, and H2O PySparkling to develop a sentiment analysis model capable of providing real-time alerts on customer product reviews. In particular, this model allows users to monitor any natural language text (such as social media posts or Amazon reviews) and receive alerts when customers post extremely nice (high sentiment) or extremely negative (low sentiment) comments about their products.
In addition to introducing the frameworks used, we will also discuss the concepts of embedding spaces, sentiment analysis, deep neural networks, grid search, stop words, data visualization, and data preparation.

Click through for the demo.

Comments closed

Running Spark MLlib to Feed Power BI

Published 2019-04-24 by Kevin Feasel

Brad Llewellyn shows how you can take Spark MLlib results and feed them into Power BI:

MLlib is one of the primary extensions of Spark, along with Spark SQL, Spark Streaming and GraphX. It is a machine learning framework built from the ground up to be massively scalable and operate within Spark. This makes it an excellent choice for machine learning applications that need to crunch extremely large amounts of data. You can read more about Spark MLlib here.
In order to leverage Spark MLlib, we obviously need a way to execute Spark code. In our minds, there’s no better tool for this than Azure Databricks. In the previous post, we covered the creation of an Azure Databricks environment. We’re going to reuse that environment for this post as well. We’ll also use the same dataset that we’ve been using, which contains information about individual customers. This dataset was originally designed to predict Income based on a number of factors. However, we left the income out of this dataset a few posts back for reasons that were important then. So, we’re actually going to use this dataset to predict “Hours Per Week” instead.

Check it out. And Brad’s not joking when he says the resulting model is terrible. But that’s okay, because it was never about the model.

Comments closed

Flink: Batch as a Special Case of Streaming

Published 2019-04-23 by Kevin Feasel

Fabian Hueske and Aljoscha Krettek describe streaming versus batch processing in Apache Flink:

The Apache Flink project has followed the philosophy of taking a unified approach to batch and stream data processing, building on the core paradigm of “continuous processing of unbounded data streams” for a long time. If you think about it, carrying out offline processing of bounded data sets naturally fits the paradigm: these are just streams of recorded data that happen to end at some point in time.
Flink is not alone in this: there are other projects in the open source community that embrace “streaming first, with batch as a special case of streaming,” such as Apache Beam; and this philosophy has often been cited as a powerful way to greatly reduce the complexity of data infrastructures by building data applications that generalize across real-time and offline processing.

Check it out. At the end, the authors also describe Blink, a fork of Flink being (slowly) merged back in and which supports this paradigm.

Comments closed

Multi-Tenant Security in Kudu + Impala

Published 2019-04-23 by Kevin Feasel

Grant Henke shows how you can combine Apache Impala’s fine-grained authorization with Apache Kudu’s coarse-grained authentication for multi-tenant scenarios:

Kudu supports coarse-grained authorization of client requests based on the authenticated client Kerberos principal. The two levels of access which can be configured are:
1. Superuser – principals authorized as a superuser are able to perform certain administrative functionality such as using the kudu command line tool to diagnose or repair cluster issues.
2.User – principals authorized as a user are able to access and modify all data in the Kudu cluster. This includes the ability to create, drop, and alter tables as well as read, insert, update, and delete data.
Access levels are granted using whitelist-style Access Control Lists (ACLs), one for each of the two levels.

Read on to see how to tie it all together.

Comments closed

Pivoting Spark DataFrames

Published 2019-04-17 by Kevin Feasel

Unmesha Sreeveni shows how we can pivot a DataFrame in Apache Spark using one line of code:

A pivot can be thought of as translating rows into columns while applying one or more aggregations.
Lets see how we can achieve the same using the above dataframe.
We will pivot the data based on “Item” column.

Click through for the code. This is an area where dropping back into Scala or Python is a lot more lines-of-code efficient than sticking to SQL.

Comments closed

Troubleshooting Spark Performance

Published 2019-04-17 by Kevin Feasel

Bikas Saha and Mridul Murlidharan explain some of the basics of performance tuning with Apache Spark:

Our objective was to build a system that would provide an intuitive insight into Spark jobs that not just provides visibility but also codifies the best practices and deep experience we have gained after years of debugging and optimizing Spark jobs. The main design objectives were to be
– Intuitive and easy – Big data practitioners should be able to navigate and ramp quickly
– Concise and focused – Hide the complexity and scale but present all necessary information in a way that does not overwhelm the end user
– Batteries included – Provide actionable recommendations for a self service experience, especially for users who are less familiar with Spark
– Extensible – To enable additions of deep dives for the most common and difficult scenarios as we come across them

The tool looks pretty interesting and I’m hoping it will be part of the open source suite at Cloudera.

Comments closed

Exactly-Once Writes From Kafka To S3

Published 2019-04-15 by Kevin Feasel

Konstantine Karantasis takes us through writing from a Kafka topic into S3:

When customers were asking for an S3 connector, there were already several Kafka-to-S3 solutions out there at the time, so we had to decide whether to adopt an existing S3 connector, modify the Kafka Connect HDFS connector (as some developers attempted to do) or write a new connector from scratch.
We knew that our users needed three things from the connector:
1. Integration with the Kafka Connect API: Connect’s scaling and fault tolerance capabilities were important to have, and users didn’t want yet another system that they’d need to learn how to use, deploy and monitor.
2. Exactly once: Users didn’t want to waste expensive compute cycles on deduplicating their data. And no one likes missing events.
3. No extra dependencies: Especially dependencies on additional datastores. Kafka clients and the S3 SDK libraries should be all you need to get events from Kafka to S3. Simplicity rules, especially in a distributed systems world where simple is often the key to being reliable.
When we considered the existing connectors, we noticed that none of them delivered the reliability and exactly once capabilities we wanted. They treat S3 like it’s another file system—though it isn’t really. For example, S3 lacks file appends, it is eventually consistent, and listing a bucket is often a very slow operation.

Click through for a dive into what Confluent did and how it works.

Comments closed

Spark Memory Management on EMR

Published 2019-04-15 by Kevin Feasel

Karunanithi Shanmugam gives us some tips on memory management for Spark in Amazon’s ElasticMapReduce:

Amazon EMR provides high-level information on how it sets the default values for Spark parameters in the release guide. These values are automatically set in the spark-defaults settings based on the core and task instance types in the cluster.
To use all the resources available in a cluster, set the maximizeResourceAllocation parameter to true. This EMR-specific option calculates the maximum compute and memory resources available for an executor on an instance in the core instance group. It then sets these parameters in the spark-defaults settings. Even with this setting, generally the default numbers are low and the application doesn’t use the full strength of the cluster. For example, the default for spark.default.parallelism is only 2 x the number of virtual cores available, though parallelism can be higher for a large cluster.
Spark on YARN can dynamically scale the number of executors used for a Spark application based on the workloads. Using Amazon EMR release version 4.4.0 and later, dynamic allocation is enabled by default (as described in the Spark documentation).

There’s a lot in here, much of which applies to Spark in general and not just EMR.

Comments closed

Measuring HDFS Cache Performance Gains

Published 2019-04-15 by Kevin Feasel

Guy Shilo tries out the HDFS centralized cache:

HDFS offers a caching mechanism that takes advantage of the Data nodes memory. Blocks are loaded in memory and pinned there so that when a client requests those blocks they can be served directly from memory which is much faster than disk. There are some 3rd party products out there that does the same, but this option comes with Hadoop out of the box.
Hadoop has a special set of commands for managing this cache – the cacheadmin commands.
You must explicitly cache a directory or a file, and in case you cache a directory the caching is not recursive and sub directories will not be cached automatically. The full documentation can be found here. I was curious to see if Cloudera has integrated cache commands into their Cloudera manager, but was surprised to see that their documentation about it is basically a copy of the Apache hadoop guide and you still have to use the command line cacheadmin.

Click through to see how it performed in Guy’s scenario.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31