Category: Hadoop

ksqlDB 0.11.0

Published 2020-08-05 by Kevin Feasel

ksqlDB 0.11.0 contains improvements and fixes spanning stranded transient queries, overly aggressive schema compatibility checks, confusing behavior around casting nulls, bad schema management, and more. Here, we highlight a couple of additional, notable improvements.

Also on my backlog was Andy Coates, talking about key columns in ksqlDB:

ksqlDB 0.10 includes significant changes and improvements to how keys are handled. This is part of a series of enhancements that began with support for non-VARCHAR keys and will ultimately end with ksqlDB supporting multiple key columns and multiple key formats, including Avro, JSON, and Protobuf.
Before looking at the syntax changes in version 0.10, let’s first look at what is meant by keys in ksqlDB, the two types of key columns, and how this may differ from other SQL systems.

Read on, as it’s an interesting look at how different data architectures can mean radically different recommendations for key design.

Comments closed

Spark 3.0’s Structured Streaming UI

Published 2020-08-04 by Kevin Feasel

Genamo Yu, et al, show off the Structured Streaming user interface built into Apache Spark 3.0:

When a developer submits a streaming SQL query, it will be listed in the Structured Streaming tab, which includes both active streaming queries and completed streaming queries. Some basic information for streaming queries will be listed in the result table, including query name, status, ID, run ID, submitted time, query duration, last batch ID as well as the aggregate information, like average input rate and average process rate. There are three types of streaming query status, i.e., RUNNING, FINISHED and FAILED. All FINISHED and FAILED queries are listed in the completed streaming query table. The Error column shows the exception details of a failed query.

Read on to learn more.

Comments closed

Custom Windows in Apache Flink

Published 2020-07-31 by Kevin Feasel

Alexander Fedulov walks us through window options with Apache Flink:

In the previous articles of the series, we described how you can achieve flexible stream partitioning based on dynamically-updated configurations (a set of fraud-detection rules) and how you can utilize Flink’s Broadcast mechanism to distribute processing configuration at runtime among the relevant operators.
Following up directly where we left the discussion of the end-to-end solution last time, in this article we will describe how you can use the “Swiss knife” of Flink – the Process Function to create an implementation that is tailor-made to match your streaming business logic requirements. Our discussion will continue in the context of the Fraud Detection engine. We will also demonstrate how you can implement your own custom replacement for time windows for cases where the out-of-the-box windowing available from the DataStream API does not satisfy your requirements. In particular, we will look at the trade-offs that you can make when designing a solution which requires low-latency reactions to individual events.
This article will describe some high-level concepts that can be applied independently, but it is recommended that you review the material in part one and part two of the series as well as checkout the code base in order to make it easier to follow along.

It’s worth giving this a careful read.

Comments closed

Spark Director Reader in Hive

Published 2020-07-30 by Kevin Feasel

Anishek Agarwal, et al, announce a new reader for Hive Warehouse Connector:

Apache Hive supports transactional tables which provide ACID guarantees. There has been a significant amount of work that has gone into hive to make these transactional tables highly performant. Apache Spark provides some capabilities to access hive external tables but it cannot access hive managed tables. To access hive managed tables from spark Hive Warehouse Connector needs to be used.
We are happy to announce Spark Direct Reader mode in Hive Warehouse Connector which can read hive transactional tables directly from the filesystem. This feature has been available from CDP-Public-Cloud-2.0 (7.2.0.0) and CDP-DC-7.1 (7.1.1.0) releases onwards.
Hive Warehouse Connector (HWC) was available to provide access to managed tables in hive from spark, however since this involved communication with LLAP there was an additional hop to get the data and process it in spark vs the ability of spark to directly read the data from FileSystem for External tables. This leads to performance degradation in accessing data from managed tables vs external tables. Additionally a lot of use cases for HWC were associated with ETL jobs where a super user was running these jobs to update data in multiple tables hence authorization was not a strong business need for this case. HWC Spark Direct Reader is an additional mode available in HWC which tries to address the above concerns. This article describes the usage of spark direct reader to consume hive transactional table data in a spark application. It also introduces the methods and APIs to read hive transactional tables into spark dataframes. Finally, it demonstrates the transaction handling and semantics while using this reader.

Click through to learn how it works and see it in action.

Comments closed

Postgres Change Data Capture into Kafka

Published 2020-07-30 by Kevin Feasel

Abhishek Gupta walks us through an example of change data capture to track events:

Change Data Capture (CDC) is a technique used to track row-level changes in database tables in response to create, update and delete operations. Different databases use different techniques to expose these change data events – for example, logical decoding in PostgreSQL, MySQL binary log (binlog) etc. This is a powerful capability, but useful only if there is a way to tap into these event logs and make it available to other services which depend on that information.
Debezium does just that! It is a distributed platform that builds on top of Change Data Capture features available in different databases. It provides a set of Kafka Connect connectors which tap into row-level changes (using CDC) in database table(s) and convert them into event streams. These event streams are sent to Apache Kafka which is a scalable event streaming platform – a perfect fit! Once the change log events are in Kafka, they will be available to all the downstream applications.

Click through for the demo, using Azure components.

Comments closed

Building an End-to-End Streaming App with Flink SQL

Published 2020-07-29 by Kevin Feasel

Jark Wu lays down the guantlet:

Apache Flink 1.11 has released many exciting new features, including many developments in Flink SQL which is evolving at a fast pace. This article takes a closer look at how to quickly build streaming applications with Flink SQL from a practical point of view.
In the following sections, we describe how to integrate Kafka, MySQL, Elasticsearch, and Kibana with Flink SQL to analyze e-commerce user behavior in real-time. All exercises in this blogpost are performed in the Flink SQL CLI, and the entire process uses standard SQL syntax, without a single line of Java/Scala code or IDE installation.

Read on for a demo using only bash and Flink SQL.

Comments closed

Dates and Timestamps in Spark 3.0

Published 2020-07-29 by Kevin Feasel

Maxim Gekk, et al, look at the different date and time data types in Apache Spark 3.0:

The definition of a Date is very simple: It’s a combination of the year, month and day fields, like (year=2012, month=12, day=31). However, the values of the year, month and day fields have constraints, so that the date value is a valid day in the real world. For example, the value of month must be from 1 to 12, the value of day must be from 1 to 28/29/30/31 (depending on the year and month), and so on.
These constraints are defined by one of many possible calendars. Some of them are only used in specific regions, like the Lunar calendar. Some of them are only used in history, like the Julian calendar. At this point, the Gregorian calendar is the de facto international standard and is used almost everywhere in the world for civil purposes. It was introduced in 1582 and is extended to support dates before 1582 as well. This extended calendar is called the Proleptic Gregorian calendar.
Starting from version 3.0, Spark uses the Proleptic Gregorian calendar, which is already being used by other data systems like pandas, R and Apache Arrow. Before Spark 3.0, it used a combination of the Julian and Gregorian calendar: For dates before 1582, the Julian calendar was used, for dates after 1582 the Gregorian calendar was used. This is inherited from the legacy java.sql.Date API, which was superseded in Java 8 by java.time.LocalDate, which uses the Proleptic Gregorian calendar as well.

Even in this three-paragraph snippet, you can already get a feeling for how complex working with dates can be. Then throw in the complexities of time and you get a detailed post full of good information.

Comments closed

Transforming JSON to CSV: ADF vs Databricks

Published 2020-07-29 by Kevin Feasel

Rayis Imayev compares two methods of transforming a JSON-structured data set into a CSV:

There is a well known and broadly advertised message from Microsoft that Azure Data Factory (ADF) is a code-free environment to help you to create your data integration solutions – https://azure.microsoft.com/en-us/resources/videos/microsoft-azure-data-factory-code-free-cloud-data-integration-at-scale/. I agree and support this approach of using drag and drop visual UI to build and automate data pipelines without writing code. However, I’m also interested to try if I can recreate certain ADF operations by writing code, just out of my curiosity.

Rayis includes a link to the Azure Data Factory step-by-step demonstration and then kicks it up a notch with Databricks. Read on to see how the two compare.

Comments closed

Exactly-Once Semantics in Kafka

Published 2020-07-27 by Kevin Feasel

Boyang Chen and Bob Barrett look at some changes to exactly-once semantics in Apache Kafka:

When using EOS, the producer and broker both have logic to determine whether it is safe for a producer to continue to send data without violating the exactly-once guarantees. Prior to Kafka 2.5, if either the producer or broker was ever not able to make this determination, the producer would enter a fatal error state. The only way to continue processing was to close the producer and create a new one. This process is generally very disruptive to client applications. For example, if a producer fails in Kafka Streams, then the associated task needs to be migrated, which causes a rebalance of the full workload. This results in throughput drop until the rebalance is complete.
To address this issue, KIP-360 added a mechanism for producers to automatically recover when they encounter these cases and continue processing. To better understand how it works, the following describes some of the situations that can cause fatal errors.

There have been several improvements to the process. Though to be honest, when I hear someone mention exactly-once in a distributed system, it sets off my spidey senses.

Comments closed

Apache Kafka in the Gaming Industry

Published 2020-07-24 by Kevin Feasel

Kai Wähner walks us through a few use cases for Apache Kafka in online gaming:

This blog post explores how event streaming with Apache Kafka provides a scalable, reliable, and efficient infrastructure to make gamers happy and Gaming companies successful. Various use cases and architectures in the gaming industry are discussed, including online and mobile games, betting, gambling, and video streaming.
Learn about:
– Real-time analytics and data correlation of game telemetry
– Monetization network for real-time advertising and in-app purchases
– Payment engine for betting
– Detection of financial fraud and cheating
– Chat function in games and cross-games
– Monitor the results of live operations like weekend events or limited time offers
– Real-time analytics on metadata and chat data for marketing campaigns

It’s an interesting overview of where this platform fits in the industry.

Comments closed