Press "Enter" to skip to content

Category: Hadoop

Using Kafka To Go From Batch To Stream

Stephane Maarek has started a series on transforming a batch process into a streaming process using Apache Kafka.  Part one introduces the topic and two of the four microservices:

Before jumping straight in, it’s very important to map out the current process and see how we can improve each component. Below are my personal assumptions:

  • When a user writes a review, it gets POSTed to a Web Service (REST Endpoint), which will store that review into some kind of database table.

  • Every 24 hours, a batch job (could be Spark) would take all the new reviews and apply a spam filter to filter fraudulent reviews from legitimate ones.

  • New valid reviews are published to another database table (which contains all the historic valid reviews).

  • Another batch job or a SQL query computes new stats for courses. Stats include all-time average rating, all-time count of reviews, 90 days average rating, and 90 days count of reviews.

  • The website displays these metrics through a REST API when the user navigates a website.

Part two finishes up the story:

In the previous section, we learned about the early concepts of Kafka Streams, to take a stream and split it in two based on a spam evaluation function. Now, we need to perform some stateful computations such as aggregations, windowing in order to compute statistics on our stream of reviews.

Thankfully we can use some pre-defined operators in the High-Level DSL that will transform a KStream into a KTable. A KTable is basically a table that gets new events every time a new element arrives in the upstream KStream. The KTable then has some level of logic to update itself. Any KTable updates can then be forwarded downstream. For a quick overview of KStream and KTable, I recommend the quickstart on the Kafka website.

This is a nice introduction to Kafka Streams using a realistic example.

Comments closed

Analytics Platform System V7 Released

Microsoft has released a new version of their Analytics Platform System:

Microsoft is pleased to announce that the Analytics Platform System (APS) appliance update 7 (AU7) is now generally available. APS is Microsoft’s scale-out Massively Parallel Processing (MPP) system based on SQL Server for data warehouse specific workloads on-premises.

Customers will get significantly improved query performance and enhanced security features with this release. APS AU7 builds on appliance update 6 (APS 2016) release as a foundation. Upgrading to APS appliance update 6 is a prerequisite to upgrade to appliance update 7.

This is useful for the six customers which can afford the licensing for APS.

Comments closed

Hortonworks Released HDP 2.6.5

Mitra Mohsenian and Roni Fontaine announce Hortonworks Data Platform 2.6.5:

We are excited to make several product announcements including the general availability of :

  • HDP 2.6.5
    • Apache Kafka 1.0
    • Apache Spark 2.3
  • Apache Ambari 2.6.2
  • SmartSense 1.4.5

HDP 2.6.5 is an important release for Hortonworks given it is the first release that enables Apache Kafka 1.0 and Apache Spark 2.3

It looks like Ubuntu 18.04 isn’t supported just yet, but I imagine that’s coming.

Comments closed

Using Burrow To Monitor Kafka

Gaurav Garg shows us how to install and configure Burrow, a tool for monitoring Apache Kafka clusters:

According to Burrow’s GitHub page: Burrow is a Kafka monitoring tool that keeps track of consumer lag. It does not provide any user interface to monitor. It provides several HTTP request endpoints to get information about Kafka clusters and consumer groups. Burrow also has a notifier system that can notify you (via email or at an HTTP endpoint) if a consumer group has met certain criteria.

Burrow is designed in a modular way that separates the work done into multiple subsystems. Below are the subsystems in Burrow.

  • Clusters: This component periodically updates the topic list and the last committed offset for each partition.
  • Consumers: This component fetches the information about consumer groups like consumer lag, etc.
  • Storage: This component stores all the information in a system.
  • Evaluator: Gets information from storage and checks the status of consumer groups, like if it’s consuming messages at a slow rate using consumer lag evaluation rules.
  • Notifier: Requests the status of a consumer group and sends a notification if certain criteria are met via email, etc.
  • HTTP server: Provides HTTP endpoints to fetch information about a cluster and consumers.

This looks like a good tool to hook into an existing monitoring solution.

Comments closed

A SQL Client For Apache Flink

Alex Woodie points out that Apache Flink now has a SQL client built in:

Apache Flink has contained SQL functionality since Flink version 1.1, which introduced a SQL API based on Apache Calcite and a table API, too. While the combined SQL and Table API today provides valuable ways for developers to apply well-understood relational data and SQL constructs to the world of stream data processing, its usefulness is somewhat limited.

For starters, only Scala and Java experts can avail themselves of API, according to the description of the new SQL client, which is codenamed FLIP-24. What’s more, any table program that was written with the SQL and Table API had to be packaged with Apache Maven, a Java-based project management tool, and submitted to the Flink cluster before running.

With the launch of the SQL CLI Client in Flink version 1.5, the Flink community is taking its support for SQL in a new direction. According to the FLIP-24 project page, providing an interactive shell will not only make Flink accessible to non-programmers, including data scientists, but it will also eliminate the need for a full IDE to program Flink apps. With millions of SQL-loving data analysts out there, the benefits could certainly be vast.

Good stuff.  Feasel’s Law in action.

Comments closed

Spark: DataFrame To RDD For Data Cleansing

Gilad Moscovitch walks us through a common data cleansing problem with Spark data frames:

A problem can arise when one of the inner fields of the json, has undesired non-json values in some of the records.
For instance, an inner field might contains HTTP errors, that would be interpreted as a string, rather than as a struct.
As a result, our schema would look like:
root
 |– headers: struct (nullable = true)
 |    |– items: array (nullable = true)
 |    |    |– element: struct (containsNull = true)
 |– requestBody: string (nullable = true)
Instead of
root
 |– headers: struct (nullable = true)
 |    |– items: array (nullable = true)
 |    |    |– element: struct (containsNull = true)
 |– requestBody: struct (nullable = true)
 |    |– items: array (nullable = true)
 |    |    |– element: struct (containsNull = true)
When trying to explode a “string” type, we will get a miss type error:
org.apache.spark.sql.AnalysisException: Can’t extract value from requestBody#10

Click through to see how to handle this scenario cleanly.

Comments closed

Visualization Over Kafka And KSQL

Shant Hovsepian shows off a data visualization tool which can read Kafka Streams data:

KSQL is a game-changer not only for application developers but also for non-technical business users. How? The SQL interface opens up access to Kafka data to analytics platforms based on SQL. Business analysts who are accustomed to non-coding, drag-and-drop interfaces can now apply their analytical skills to Kafka. So instead of continually building new analytics outputs due to evolving business requirements, IT teams can hand a comprehensive analytics interface directly to the business analysts. Analysts get a self-service environment where they can independently build dashboards and applications.

Arcadia Data is a Confluent partner that is leading the charge for integrating visual analytics and BI technology directly with KSQL. We’ve been working to combine our existing analytics stack with KSQL to provide a platform that requires no complicated new skills for your analysts to visualize streaming data. Just as they will create semantic layers, build dashboards, and deploy analytical applications on batch data, they can now do the same on streaming data. Real-time analytics and visualizations for business users have largely been a misnomer until now. For example, some architectures enabled visualizations for end users by staging Kafka data into a separate data store, which added latency. KSQL removes that latency to let business users see the most recent data directly in Kafka and react immediately.

Click through for a couple repos and demos.

Comments closed

Installing Kafka On Ubuntu

Gaurav Garg has an article on installing Apache Kafka on a fresh Ubuntu installation:

For beginners, the default configurations of the Kafka broker are good enough, but for production-level setup, one must understand each configuration. I am going to explain some of these configurations.

  • broker.id: The ID of the broker instance in a cluster.
  • zookeeper.connect: The ZooKeeper address (can list multiple addresses comma-separated for the ZooKeeper cluster). Example: localhost:2181,localhost:2182.
  • zookeeper.connection.timeout.ms: Time to wait before going down if, for some reason, the broker is not able to connect.

Read the whole thing.

Comments closed

Serializing Data In Scala

Akhil Vijayan has a two-parter on serializing data in Scala.  In the first post, he looks at uPickle:

uPickle serializer is a lightweight Json library for scala. uPickle is built on top of uJson which are used for easy manipulation of json without the need of converting it to a scala case class. We can even use uJson as standalone too. In this blog, I will focus only on uPickle library.

Note: uPickle does not support Scala 2.10; only 2.11 and 2.12 are supported

uPickle (pronounced micro-pickle) is a lightweight JSON serialization library which is fast than many other json serializers. I will talk more about the comparison of different serializers in my next blog. This blog will cover all the basic stuff about uPickle.

Then, he follows up with a comparison to other serializers:

In my previous blog, I talked about how uPickle works. Now I will be comparing it will many other json serializers where I will be serializing and deserializing scala case class.

Before that let me discuss all the json serializers that I have used in my comparison. I will compare uPickle with PlayJson, Circe and Argonaut.

Check it out.

Comments closed