Stream Processing With Kafka And Spark

Satendra Kumar has a slide deck looking at combining Spark Streaming with Kafka:

Knoldus organized a Meetup on Friday, 9 September 2016. Topics which were covered in this meetup are:

  1. Overview of Spark Streaming.

  2. Fault-tolerance Semantics & Performance Tuning.

  3. Spark Streaming Integration with  Kafka.

Click through for the slide deck.  Combine that with the AWS blog post on the same topic and you get a pretty good intro.

Kafka Plus Spark Streaming

Prasad Alle shows how to integrate Kafka with Spark Streaming on AWS:

Stream processing walkthrough

The entire pattern can be implemented in a few simple steps:

  1. Set up Kafka on AWS.

  2. Spin up an EMR 5.0 cluster with Hadoop, Hive, and Spark.

  3. Create a Kafka topic.

  4. Run the Spark Streaming app to process clickstream events.

  5. Use the Kafka producer app to publish clickstream events into Kafka topic.

  6. Explore clickstream events data with SparkSQL.

This is a pretty easy-to-follow walkthrough with some good tips at the end.

Whither Hadoop?

Kevin Feasel



Kashif Saiyed looks at recent trends in Hadoop:

  • 2016 and beyond – this is an interesting timing for “Big Data”. Cloudera’s valuation has dropped by 38%. Hortonwork’s valuation has dropped by almost 40%, forcing them to cut the professional services department. Pivotal has abandoned its Hadoop distribution, going to market jointly with Hortonworks. What happened and why? I think the main driver of this decline is enterprise customers that started adoption of technology in 2014-2015. After a couple of years playing around with “Big Data” they has finally understood that Hadoop is only an instrument for solving specific problems, it is not a turnkey solution to take over your competitors by leveraging the holy power of “Big Data”. Moreover, you don’t need Hadoop if you don’t really have a problem of huge data volumes in your enterprise, so hundreds of enterprises were hugely disappointed by their useless 2 to 10TB Hadoop clusters – Hadoop technology just doesn’t shine at this scale. All of this has caused a big wave of priorities re-evaluation by enterprises, shrinking their investments into “Big Data” and focusing on solving specific business problems.

There are some good points around product saturation and a general skills shortage, but even if you look at it pessimistically, this is a product with 30% market penetration, and which is currently making the move from being a large batch data processing product to a streaming + batch processing product.

Converting CSV To Parquet

Praveen Sripati shows how to use Spark Dataframes to convert a CSV file into a Parquet format:

In the previous blog, we looked at on converting the CSV format into Parquet format using Hive. It was a matter of creating a regular table, map it to the CSV data and finally move the data from the regular table to the Parquet table using the Insert Overwrite syntax. In this blog we will look at how to do the same thing with Spark using the dataframes feature.

Most of the code is basic setup; writing to Parquet is really a one-liner.

Comparing File Formats In Hadoop

Andrew Peterson points out performance comparisons for various Hadoop file formats:

According to a posting on the Hortonworks site, both the compression and the performance for ORC files are vastly superior to both plain text Hive tables and RCfile tables. For compression, ORC files are listed as 78% smaller than plain text files. And for performance, ORC files support predicate pushdown and improved indexing that can result in a 44x (4,400%) improvement. Needless to say, for Hive, ORC files will gain in popularity.  (you can read the posting here: ORC File in HDP 2: Better Compression, Better Performance).

There are several considerations around picking the correct file format, and it’s probably best to experiment with them in your specific environment.

Integrating Spark With Hive

Rahul Kumar wants to write Scala code to access the Hive datastore:

Hello geeks, we have discussed how to start programming with Spark in Scala. In this blog, we will discuss how we can use Hive with Spark 2.0.

When you start to work with Hive, you need HiveContext (inherits SqlContext), core-site.xml,hdfs-site.xml, and hive-site.xml for Spark. In case you don’t configure hive-site.xml then the context automatically creates metastore_db in the current directory and creates warehousedirectory indicated by HiveConf(which defaults user/hive/warehouse).

Rahul has made his demo code available on GitHub.

Apache Kudu

Kevin Feasel



Greg Rahn discusses Apache Kudu:

At a high level, Kudu is a new storage manager that enables durable single-record inserts, updates, and deletes, as well as fast and efficient columnar scans due to its in-memory row format and on-disk columnar format.  This architecture makes Kudu very attractive for data that arrives as a single record at a time or that may need to be modified at a later time.

Today, many users try to solve this challenge via a Lambda architecture, which presents inherent challenges by requiring different code bases and storage for the necessary batch and real-time components. Using Kudu and Impala together completely avoids this problematic complexity by easily and immediately making data inserted into Kudu available for querying and analytics via Impala. (For more technical details on how Impala and Kudu work together for analytical workloads, see this post.)

I’d jokingly say “Someday, somebody’s going to reinvent the relational database inside of Hadoop.”  But it seems like that’s less of a joke than a medium-term prediction.

HBase Transactions

Kevin Feasel



George Leopold describes Omid:

The transaction manager utilizes a lock-free approach to support multiple clients and relies on a centralized conflict detection component to resolve write-set collisions among concurrent transactions. Developers added that Omid requires no modifications to the underlying HBase key-value data store.

It also features a simplified API that mimics transaction manager APIs in relational databases. Client and server configuration processes also were simplified to help both application developers and system administrators.

Filing this one under the “What’s old is new again” category.

Comparing Impala To Redshift

Mostafa Mokhtar, et al, have a comparison of Apache Impala to Amazon Redshift:

For this analysis, we used TPC-DS on a 3TB dataset and selected 70 out of 99 the queries that run without any modifications or uses variants on both Redshift and Impala. We wanted to use a larger dataset (similar to what we’ve used in previous benchmarks), but due to Redshift’s data load times, we had to reduce the data size. (Note: This benchmark is derived from the TPC-DS benchmark and, as such, is not directly comparable to published TPC-DS results.)

This is coming from one of the two vendors, so take it with however many grains of salt you’d like.

HDP 2.5 Sandbox

Kevin Feasel



The Hortonworks Data Platform 2.5 sandbox is now available:


1) Download HDP Sandbox as a VM image(VMware and Virtualbox or Docker
2) Setup and Start the VM image.
3) Try a Sandbox tutorial, check out the list of free tutorials, or jump directly into an Hello to HDP hands-on tutorial.
4) Need more help? Visit the Hortonworks Community Connection(HCC) and interact directly with the community and our development team.

It looks like they’ve bumped up the RAM requirements to 8 GB and have added new tutorials.


July 2019
« Jun