Category: Hadoop

When we do a transformation on any RDD, it gives us a new RDD. But it does not start the execution of those transformations. The execution is performed only when an action is performed on the new RDD and gives us a final result.

So once you perform any action on an RDD, Spark context gives your program to the driver.

The driver creates the DAG (directed acyclic graph) or execution plan (job) for your program. Once the DAG is created, the driver divides this DAG into a number of stages. These stages are then divided into smaller tasks and all the tasks are given to the executors for execution.

Click through for more details.

Comments closed

Five Books For Learning Kafka

Published 2018-04-23 by Kevin Feasel

Data Flair has a guide to five books to help you learn Apache Kafka:

The book “Kafka: The Definitive Guide” is written by engineers from Confluent andLinkedIn who are responsible for developing Kafka.

They explain how to deploy production Kafka clusters, write reliable event-driven microservices, and build scalable stream-processing applications with this platform. It contains detailed examples as well. Through this including the replication protocol, you’ll learn Kafka’s design principles, reliability guarantees, key APIs, and architecture details, the controller, and the storage layer. However, even if you are new to Apache Kafka as the application architect, developer, or production engineer, this practical guide shows you how to use this open source streaming platform to handle real-time data feeds.

I haven’t read any of them yet, but a couple look interesting enough to add to my to-read list.

Comments closed

Push-Based Alerting With Kafka Streams

Published 2018-04-16 by Kevin Feasel

Robin Moffatt shows how to take syslog data and create a notification app using Python and Kafka Streams:

Now we can query from it and show the aggregate window timestamp alongside the result:

ksql> SELECT ROWTIME, TIMESTAMPTOSTRING(ROWTIME, 'yyyy-MM-dd HH:mm:ss'), \
HOST, INVALID_LOGIN_COUNT \
FROM INVALID_USERS_LOGINS_PER_HOST;
1521644100000 | 2018-03-21 14:55:00 | rpi-03 | 1
1521646620000 | 2018-03-21 15:37:00 | rpi-03 | 2
1521649080000 | 2018-03-21 16:18:00 | rpi-03 | 1
1521649260000 | 2018-03-21 16:21:00 | rpi-03 | 4
1521649320000 | 2018-03-21 16:22:00 | rpi-03 | 2
1521649080000 | 2018-03-21 16:38:00 | rpi-03 | 2

In the above query I’m displaying the aggregate window start time, ROWTIME (which is epoch), and converting it also to a display string, using TIMESTAMPTOSTRING. We can use this to easily query the stream for a given window of interest. For example, for the window beginning at 2018-03-21 16:21:00 we can see there were four invalid user login attempts. We can easily check the source data for this, using the ROWTIME in the above output for the window (16:21 – 16:22) as the bounds for the predicate:

It’s a very interesting use case.

Comments closed

Spark Architecture: The Spark Streaming Receiver

Published 2018-04-13 by Kevin Feasel

Oleksii Yermolenko gives us an overview of the Receiver object in Spark Streaming:

The key component of Spark streaming application is called Receiver. It is responsible for opening new connections with the sources, listening events from them and aggregating incoming data within the memory. If receiver’s worker node is running out of memory, it starts using disk storage for persistence operations. But this negatively impacts the overall application’s performance.

All incoming data is first aggregated within receiver into chunks called Blocks. After preconfigured interval of time called batchInterval Spark does logical aggregation of these blocks into another entity called Batch. Batch has links to all blocks formed by receivers and uses this information for generation of RDD. This is the main Spark’s entity which is used by the engine for the operations upon the data. Normally RDD would consist of a number of partitions where each partition would reference the block generated by the receiver on the start stage. Streaming application can have lots of receivers located at different physical nodes, so the actual data would be distributed across the cluster from the start. Batch interval is global for the whole application and is defined on the stage of creation of Streaming Context. Block generation interval is a receiver based property which could be defined through the configuration of spark.streaming.blockInterval property. By default blocks would be generated every 200ms but you can tune this property according to the nature of your data.

Read the whole thing, which includes some tips on design.

Comments closed

Hadoop 3.1 Released

Published 2018-04-13 by Kevin Feasel

Wangda Tan and Vinod Kumar Vavilapalli have a post on Hadoop 3.1.0:

This release is *not* yet ready for production use. Critical issues are being ironed out via testing and downstream adoption. Production users should wait for a 3.1.1/3.1.2 release.

The Hadoop community fixed 768 JIRAs (https://s.apache.org/apache-hadoop-3.1.0-all-tickets) in total as part of the 3.1.0 release. Of these fixes:
– 141 in Hadoop Common
– 266 in HDFS
– 329 in YARN
– 32 in MapReduce
Apache Hadoop 3.1.0 contains a number of significant features and enhancements.

YARN supporting GPUs and FPGAs is very interesting.

Comments closed

Using Hive: Tiered Or Decoupled Storage?

Published 2018-04-06 by Kevin Feasel

Brandon Wilson and Gopal Vijayaraghavan compare a series of Hive queries against EC2 instances with persistent storage and S3:

There are advantages and disadvantages to each approach. The tiered approach has the most flexibility for an operator to tune the performance of the cluster while trading off size of the hot data zone for better performance or smaller resource footprint. The downside of this approach is that, having data on HDFS, resizing the cluster is a slow and tedious process due to HDFS needing to be rebalanced to achieve performance and fault-tolerance expectations. Thus this architecture is generally only used for statically sized clusters with steady, well-known workloads.

The decoupled architecture, on the other hand, enables maximum flexibility for cluster growth and reduction. For example, a cluster could run at 100 nodes during the day to support analytics and reporting and then shrink to 24 nodes overnight to support smaller ETL workloads. Historically, the disadvantage to decoupling is that cloud storage is not local and therefore could drastically affect runtime of the analytical workloads (hence the hybrid approach of tiered storage). However, the advent of LLAP in Hive 2.0 has limited this overhead making the approach far more attractive. The dynamic cache within LLAP also means that we do not need to statically define what data is hot. It can be inferred at query time (i.e., as users access the data, that data will become hot). We will look closer at how LLAP closes the runtime gap in the next section.

Historically, the argument was that you should avoid S3 in part because it’s relatively flaky compared to disks (in terms of performance and in its eventual consistency model). It looks like that’s no longer a pressing concern.

Comments closed

Filtering On Kafka Streams

Published 2018-04-06 by Kevin Feasel

Robin Moffatt has a new series showing how to use Kafka Streams for dealing with syslog data:

syslog is one of those ubiquitous standards on which much of modern computing runs. Built into operating systems such as Linux, it’s also commonplace in networking and IoT devices like IP cameras. It provides a way for streaming log messages, along with metadata such as the source host, severity of the message, and so on. Sometimes the target is simply a local logfile, but more often it’s a centralised syslog server which in turn may log or process the messages further.

As a high-performance, distributed streaming platform, Apache Kafka® is a great tool for centralised ingestion of syslog data. Since Apache Kafka also persists data and supports native stream processing we don’t need to land it elsewhere before we can utilise the data. You can stream syslog data into Kafka in a variety of ways, including through Kafka Connect for which there is a dedicated syslog plugin.

In this post, we’re going to see how KSQL can be used to process syslog messages as they arrive in realtime.

Check it out.

Comments closed

Working With Jupyter Notebooks And Airflow On Hadoop

Published 2018-04-04 by Kevin Feasel

Mark Litwintschik shows us an interesting demonstration of running Jupyter Notebooks as well as automating tasks with Airflow on Hadoop:

The following will create a ~/airflow folder, setup a SQLite 3 database used to store Airflow’s state and configuration set via the Web UI, upgrade the configuration schema and create a folder for the Python-based jobs code Airflow will run.
$ cd ~
$ airflow initdb
$ airflow upgradedb
$ mkdir -p ~/airflow/dags
By default Presto’s Web UI, Spark’s Web UI and Airflow’s Web UI all use TCP port 8080. If you launch Presto after Spark then Presto will fail to start. If you start Spark after Presto then Presto will launch on 8080 and the Spark Master Server will take 8081 and keep trying higher ports until it finds one that is free. Spark will then pick an even higher port number for the Spark Worker Web UI. This overlap normally isn’t an issue as in a production setting these services would normally live on separate machines.

Read the whole thing.

Comments closed

HDFS Erasure Coding

Published 2018-04-02 by Kevin Feasel

Ayush Tiwari explains how HDFS Erasure Coding lets us keep the same data durability while improving storage efficiency:

In this particular example, when you look at the codeword, actual data cells are 6(blue cells) & 3(red cells) are the parity cells which are simply obtained by multiplied our data cells to generation matrix.

Storage failure can be recovered by the multiplying inverse of generator matrix with the extended codewords as long as ‘k’ out of ‘k+m’ cells are available.

Therefore, here Data Durability is 3 as it can handle 2 simultaneous failure, Storage Efficiency is 67% (as we are using only 1 extra block i.e. 6/9) & only we need to store half number of cells as compared to original number of cells, we can conclude that we also have only 50% overhead in it.

It’s a good explanation of one of the biggest improvements to HDFS over the past several years.

Comments closed

Avro Schema Compatibility In Kafka

Published 2018-03-27 by Kevin Feasel

Neha Bhardwaj walks us through an error in Kafka:

You might have come across a similar exception while working with AVRO schemas.

Kafka throws this exception due to a compatibility issue since the current schema is not compatible with the earlier schema registered on this topic.

You can check the current schema(s) on the topic using:
curl -X GET <a href=”http://localhost:8081/subjects//versions/”>http://localhost:8081/subjects//versions/

Read on to understand what this error means and how you can fix it if you see it.

Comments closed