Press "Enter" to skip to content

Category: Hadoop

Working with Hive in HDInsight

Brad Llewellyn takes us through building an HDInsight cluster and writing Hive queries against it:

Hive is a “SQL on Hadoop” technology that combines the scalable processing framework of the ecosystem with the coding simplicity of SQL.  Hive is very useful for performant batch processing on relational data, as it leverages all of the skills that most organizations already possess.  Hive LLAP (Low Latency Analytical Processing or Live Long and Process) is an extension of Hive that is designed to handle low latency queries over massive amounts of EXTERNAL data.  One of this coolest things about the Hadoop SQL ecosystem is that the technologies allow us to create SQL tables directly on top of structured and semi-structured data without having to import it into a proprietary format.  That’s exactly what we’re going to do in this post.  You can read more about Hive here and here and Hive LLAP here.

We understand that SQL queries don’t typically constitute traditional data science functionality.  However, the Hadoop ecosystem has a number of unique and interesting data science features that we can explore.  Hive happens to be one of the best starting points on that journey.

Click through for the screenshot-laden demonstration.

Comments closed

Monitoring Kafka Streams with JMX Metrics

Rishi Khandelwal provides a reference architecture for monitoring a Kafka Streams application using JMX Metrics and pushing the results into Graphite:

Service (application) exposes the JMX metrics at some port which will be captured by Jolokia java agent. Then Jolokia exposes those metrics at some port which is easily accessible through a rest endpoint (we call it Jolokia URL). Then we have JMX2Graphte which polls the metrics from Jolokia URL and push it to Graphite. Then Grafana reads the Graphite metrics and creates a beautiful dashboard for us along with the alerts.

So this is the working of the proposed monitoring solution. Now let’s discuss the components of the monitoring solution.

There’s a bit of code/configuration in here as well, so check it out.

Comments closed

Sample Spark-Submit Config Settings

Leela Prasad shares a few sample configuration settings for Spark-Submit jobs:

Before going further let’s discuss on the below parameters which I have given for a Job.
spark.executor.cores=5 
spark.executor.instances=3
spark.executor.memory=20g
spark.driver.memory=5g 
spark.dynamicAllocation.enabled=true 
spark.dynamicAllocation.maxExecutors=10 

Click through to see what these do and why Leela chose these settings. The Spark documentation has the full list of settings but it’s good to hear explanations from practitioners.

Comments closed

Datasets In Spark

Ayush Hooda explains the differences between DataFrames and Datasets in Apache Spark:

The Datasets API provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL’s optimized execution engine. You can define Dataset objects and then manipulate them using functional transformations (map, flatMap, filter, and so on) similar to an RDD. The benefits are that, unlike RDDs, these transformations are now applied on a structured and strongly typed distributed collection that allows Spark to leverage Spark SQL’s execution engine for optimization.

Read on for more details and a few examples of how to operate DataFrames and Datasets.

Comments closed

Creating a UDF in KSQL

Saeed Barghi shows us how we can create user-defined functions in Kafka Streams’ KSQL language:

The UDF I want to implement here is DATEADD. If you’re familiar with SQL, you have definitely used it: it takes in a date and adds or subtracts a specific number value to a specific part of datetime, and spits out a new datetime.

To implement a User Defined Function (UDF or UDAF) you would need to code your function in Java and then import the jar file in your KSQL server. You can read about the full process here, I point out a couple of things that I believe you should pay attention to:

I think I prefer Spark’s method for UDFs in Spark SQL: create it in Scala and register the function.

Comments closed

Kafka Connect Neo4j Sink

Michael Hunger and Andrea Santurbano announce a new sink for Kafka Connect which handles writing to Neo4j:

We’ve been using the work we did for the Kafka sink – Neo4j extension and have made it available via remote connections over our binary bolt protocol. So you can stream your events from Apache Kafka®directly into Neo4j to create, update and enrich your graph structures. Then it is really up to you what you want to with the event data.

The events can come from frontend systems, API notifications, other databases or streaming systems like Apache Spark™ and Apache Flink®.

For details on the format and internals, please see our previous article or the documentation for the Neo4j sink.

Read on for details and demos.

Comments closed

Improving Spark Auto-Scaling On ElasticMapReduce

Udit Mehrotra explains some of the ways Amazon ElasticMapReduce reduces the pain of node loss in Spark jobs:

The Automatic Scaling feature in Amazon EMR lets customers dynamically scale clusters in and out, based on cluster usage or other job-related metrics. These features help you use resources efficiently, but they can also cause EC2 instances to shut down in the middle of a running job. This could result in the loss of computation and data, which can affect the stability of the job or result in duplicate work through recomputing.

To gracefully shut down nodes without affecting running jobs, Amazon EMR uses Apache Hadoop‘s decommissioning mechanism, which the Amazon EMR team developed and contributed back to the community. This works well for most Hadoop workloads, but not so much for Apache Spark. Spark currently faces various shortcomings while dealing with node loss. This can cause jobs to get stuck trying to recover and recompute lost tasks and data, and in some cases eventually crashing the job. 

Auto-scaling doesn’t always mean scaling up.

Comments closed

SparkSession and its Component Contexts

The folks at Hadoop in Real World explain the difference between SparkSession, SparkContext, SQLContext, and HiveContext:

SQLContext is your gateway to SparkSQL. Here is how you create a SQLContext using the SparkContext.
// sc is an existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

Once you have the SQLContext you can start working with DataFrame, DataSet etc.

Knowing the right entry point is important.

Comments closed

Spark RDDs and DataFrames

Ayush Hooda explains the difference between RDDs and DataFrames:

Spark SQL is a Spark module for structured data processing. Unlike the basic Spark RDD API, the interfaces provided by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Internally, Spark SQL uses this extra information to perform extra optimizations.

One use of Spark SQL is to execute SQL queries. When running SQL from within another programming language the results will be returned as a Dataset/DataFrame.

Before exploring these APIs, let’s understand the need for these APIs.

I like the piece about RDDs being better at explaining the how than the what.

Comments closed