Press "Enter" to skip to content

Category: Hadoop

Spark on Docker on YARN on Cloud

Adam Antal has included all of the layers:

Bringing your own libraries to run a Spark job on a shared YARN cluster can be a huge pain. In the past, you had to install the dependencies independently on each host or use different Python package management softwares. Nowadays Docker provides a much simpler way of packaging and managing dependencies so users can easily share a cluster without running into each other, or waiting for central IT to install packages on every node. Today, we are excited to announce the preview of Spark on Docker on YARN available on CDP DataCenter 1.0 release.

Joking about stack length aside, this looks really useful.

Comments closed

Optimal Kafka Partitioning

Paul Brebner is on a quest:

This blog provides an overview around the two fundamental concepts in Apache Kafka : Topics and Partitions. While developing and scaling our Anomalia Machina application we have discovered that distributed applications using Kafka and Cassandra clusters require careful tuning to achieve close to linear scalability, and critical variables included the number of Kafka topics and partitions. In this blog, we test that theory and answer questions like “What impact does increasing partitions have on throughput?” and “Is there an optimal number of partitions for a cluster to maximize write throughput?” And more!

Read on for some interesting findings.

Comments closed

Accessing S3 Data from Apache Spark

Divyansh Jain shows how we can connect to AWS’s S3 using Apache Spark:

Now, coming to the actual topic that how to read data from S3 bucket to Spark. Well, it is not very easy to read S3 bucket by just adding Spark-core dependencies to your Spark project and use spark.read to read you data from S3 Bucket.

So, to read data from an S3, below are the steps to be followed:

This isn’t a built-in source, so there is a little bit of work to do, but it’s not that bad.

Comments closed

Creating a Custom Partitioner for Apache Kafka

Swapnil Gosavi walks us through the process of creating a custom partitioner in Apache Kafka:

Assume we are collecting data from different departments. All the departments are sending data to a single topic named department. I planned five partitions for the topic. But, I want two partitions dedicated to a specific department, named IT, and the remaining three partitions for the rest of the departments. How would you achieve this?

You can solve this requirement, and any other type of partitioning needs by implementing a custom partitioner.

This is quite useful when you don’t necessarily want topic explosion but you do want more than what the classic partitioner allows.

Comments closed

Controlling IoT Devices via Databricks

Saeed Barghi takes us through building an interesting solution:

A few weeks ago I did a talk at AI Bootcamp here in Melbourne on how we can build a serverless solution on Azure that would take us one step closer to powering industrial machines with AI, using the same technology stack that is typically used to deliver IoT analytics use cases. I demoed a solution that received data from an IoT device, in this case a crane, compared the data with the result of a machine learning model that has ran and written its predictions to a repository, in this case a CSV file, and then decided if any actions needs to be taken on the machine, e.g. slowing the crane down if the wind picks up.

This was a really interesting article.

Comments closed

Using ksqlDB to Read Twitter Data

Robin Moffatt has a quick demonstration of ksqlDB:

I’m going to show you how to use ksqlDB to do the following:

– Configure the live ingest of a stream of data from an external source (in this case, Twitter)
– Filter the stream for certain columns
– Create a new stream populated only by messages that match a given predicate
– Build aggregate materialised views, and use pull queries to directly fetch the state from these

Let’s dive in! As always, you’ll find the full test rig for trying this out yourself on GitHub.

Comments closed

Performance Tuning Load of Partitioned Hive Tables on S3 with Spark

Dmitry Tolpeko walks us through a performance problem in Spark:

I have a Spark job that transforms incoming data from compressed text files into Parquet format and loads them into a daily partition of a Hive table. This is a typical job in a data lake, it is quite simple but in my case it was very slow.

Initially it took about 4 hours to convert ~2,100 input .gz files (~1.9 TB of data) into Parquet, while the actual Spark job took just 38 minutes to run and the remaining time was spent on loading data into a Hive partition.

Let’s see what is the reason of such behavior and how we can improve the performance.

Read on to see how.

Comments closed

Repartitioning and Coalescing in Spark

Divyansh Jain contrasts repartitioning and coalescing in Spark:

What is Coalesce?

The coalesce method reduces the number of partitions in a DataFrame. Coalesce avoids full shuffle, instead of creating new partitions, it shuffles the data using Hash Partitioner (Default), and adjusts into existing partitions, this means it can only decrease the number of partitions.

What is Repartitioning?

The repartition method can be used to either increase or decrease the number of partitions in a DataFrame. Repartition is a full Shuffle operation, whole data is taken out from existing partitions and equally distributed into newly formed partitions.

Read on to learn good reasons to use both.

Comments closed

Testing Kafka Streams

Jukka Karvanen takes us through TopologyTestDriver in Apache Kafka Streams:

Similarly to testing a single record, it is possible to pipe Value lists into a TestInputTopic. Validating the output can be done record by record like before, or by using the readValueToList() method to see the big picture when validating the whole collection at the same time. For our example topology, when the test pipes in the values, it needs to validate the keys and use the readKeyValueToList() method.

Testing streams of data (regardless of the product) is enough of a mental shift that it’s not an easy problem. For that reason, I welcome any tool which simplifies the process.

Comments closed

How Spark Runs on YARN with HDFS

Sarfaraz Hussain explains how some of the pieces of the Hadoop ecosystem fit together:

Once it verifies that everything is in place, it will assign a Job ID to the Job and then allocate the Job ID into a Job Queue.

So, in Job Queue there can be multiple jobs waiting to get processed.

As soon as a job is assigned to the Job Queue, it’s corresponding information about the Job like Input/Output Path, the location of the Jar, etc. are written into the temp location of HDFS.

Read the whole thing.

Comments closed