Press "Enter" to skip to content

Category: Hadoop

Apache Flink 1.16 Released

Godfrey He makes an announcement:

To reduce the cost of migrating Hive to Flink, we introduce HiveServer2 Endpoint and Hive Syntax Improvements in this version:

The HiveServer2 Endpoint allows users to interact with SQL Gateway with Hive JDBC/Beeline and migrate with Flink into the Hive ecosystem (DBeaver, Apache Superset, Apache DolphinScheduler, and Apache Zeppelin). When users connect to the HiveServer2 endpoint, the SQL Gateway registers the Hive Catalog, switches to Hive Dialect, and uses batch execution mode to execute jobs. With these steps, users can have the same experience as HiveServer2.

Read on for a pretty large hit list.

Comments closed

Choosing between Synapse Spark Notebooks or Job Definitions

Arun Sethia and Arshad Ali explain when you might use a Spark notebook versus a job definition:

Synapse Spark Notebook is a web-based (HTTP/HTTPS) interactive interface to create files that contain live code, narrative text, and visualizes output with rich libraries for spark based applications. Data engineers can collaborate, schedule, run, and test their spark application code using Notebooks. Notebooks are a good place to validate ideas and do quick experiments to get insight into the data. You can integrate the Synapse Notebook into Synapse pipeline.

The Notebook allows you to combine programming code with markdown text and perform simple visualizations (using Synapse Notebook chart options and open-source libraries). In addition, running code will supply immediate feedback, output, and progress tracking within Notebook.

Click through for the comparison.

Comments closed

Debugging Stream Table Joins

Philip Schmitt dives in to a problem:

Joining two topics to aggregate the data is one of the fundamental operations in stream processing. But that’s not to say that it’s simple. Let me show you what can go wrong! This article chronicles my journey to join two Apache Kafka topics—stumbling into and out of various pitfalls. I‘m going to show you…

– How to debug co-partitioning with kcat (formerly kafkacat)

– How to avoid the number one pitfall of using kcat

– Stream–table join semantics in action

There’s a lot of useful information in this post.

Comments closed

Designing Event Streams for Kafka

Dave Shook announces a new course:

Properly designing your events and event streams is essential for any event-driven architecture. Precisely how you design and implement them will significantly affect not only what you can do today, but what you can do tomorrow. For such a critical part of any data infrastructure, most event streaming tutorials gloss over event design.

In the new course on Confluent Developer, events and event streams are put front and center. We’re going to look at the dimensions of event and event stream design and how to apply them to real-world problems. But dimensions and theory are nothing without best practices, so we are also going to take a look at these to help keep you clear of pitfalls and set you up for success. This course also includes hands-on exercises, during which you will work through use cases related to the different dimensions of event design and event streaming.

Click through to learn more about what’s in the course and to check it out–it is free, after all.

Comments closed

Thoughts on Current 2022

Markos Sfikas, et al, recap the Current 2022 conference:

Throughout the conference, the theme of Batch vs. Streaming was apparent. Discussions covered how they can be unified, how batch processing’s performance can / must be improved for real-time applications, and more. There was even a dedicated panel discussion with Adi Polak, Amy Chen, Eric Sammer and Tyler Akidau discussing the state of streaming adoption today, and debating if streaming will ever fully replace batch. You can view some interesting points from the panel discussion in the Twitter thread from Robin Moffatt.

Click through for the full recap.

Comments closed

Using the Confluent Terraform Provider

Spencer Shumway has a tutorial:

As part of our recent Q3 Launch for Confluent Cloud we announced the general availability of the Confluent Terraform Provider. HashiCorp Terraform is an Infrastructure as Code tool that lets you define both cloud and on-prem resources in human-readable configuration files that you can version, reuse, and share, using standardized resource management tooling, pipelines, and processes.

Click through for a getting started video as well as the tutorial.

Comments closed

Real-Time Streaming ETL with Kafka and Debezium

Dursun Koc doesn’t have time for batched ETL:

Debezium is not extracting data using SQL. It uses database log files to track the changes in the database, so it has minimum effect on the source system. For more information about Debezium, please visit their website

After the data is extracted, we need Kafka Connect to stream it into Apache Kafka in order to play with it and reshape it as we required. And we will be using ksqlDB in order to reshape the raw data in a way we are required in the target system. Let’s consider a simple ordering system database in which we have a customer table, a product table, and an orders table, as shown below.

Read on for an overview as well as a link to the GitHub repo where you can try this all out.

Comments closed

Kafka Advisory CVE-2022-34917

Debaditya Bhattacharyya reviews the impact of a Kafka security advisory:

The Apache Kafka® project announced on September 19, 2022 that a security vulnerability has been identified in Apache Kafka, CVE-2022-34917. After being informed of this, Instaclustr began investigating its potential impact on customers of our Apache Kafka offering. This vulnerability allows malicious, unauthenticated clients to allocate large amounts of memory on the brokers. This can lead to OutOfMemoryException in the brokers causing denial of service.

Read on to learn more about the impact and techniques for mitigation.

Comments closed

Event-Driven Microservices in Python with Kafka

Dave Klein demonstrates how event-driven microservices work:

Along came microservices. Individual, smaller applications that could be changed, deployed, and scaled independently. After some initial skepticism, this architectural style took off. It truly did solve several significant problems. However, as is often the case, it brought new levels of complexity for us to deal with. We now had distributed systems that needed to communicate and depend on each other to accomplish the tasks at hand.

The most common approach to getting our applications talking to each other was to use what we were already using between our clients and servers: HTTP-based request/response communications, perhaps using REST or gRPC. This works, but it increases the coupling between our independent applications by requiring them to know about APIs, endpoints, request parameters, etc., making them less independent.

Read the whole thing.

Comments closed