Press "Enter" to skip to content

Category: Hadoop

Tumbling and Sliding Windows in Flink

Kundan Kumarr takes us through two different window types in Apache Flink:

In the previous blog, we talked about Flink’s windows operator, a heart of processing infinite streams. Generally in Flink, after specifying that the stream is keyed or non keyed, the next step is to define a window assigner. The window assigner defines how elements are assigned to windows. Flink provides some useful predefined window assigners like Tumbling windowsSliding windowsSession windows, Count windows, and Global windows. We can use any of them as per our use case or even we can create custom window assigners in Flink.

In this blog, we will learn about the first two window assigners i.e., Tumbling and sliding windows. These two window assigners, assign elements to windows based on time, which can either be processing time or event time.

Click through for a description of each.

Comments closed

ACID Transactions with Hive LLAP in ElasticMapReduce

Suthan Phillips and Chao Gao walk us through ACID transactions when using Hive on Amazon’s ElasticMapReduce platform:

ACID (atomicity, consistency, isolation, and durability) properties make sure that the transactions in a database are atomic, consistent, isolated, and reliable.

Amazon EMR 6.1.0 adds support for Hive ACID transactions so it complies with the ACID properties of a database. With this feature, you can run INSERT, UPDATE, DELETE, and MERGE operations in Hive managed tables with data in Amazon Simple Storage Service (Amazon S3). This is a key feature for use cases like streaming ingestion, data restatement, bulk updates using MERGE, and slowly changing dimensions.

This post demonstrates how to enable Hive ACID transactions in Amazon EMR, how to create a Hive transactional table, how it can achieve atomic and isolated operations, and the concepts, best practices, and limitations of using Hive ACID in Amazon EMR.

Click through for a demonstration.

Comments closed

Cloudera’s Not Dead Yet

Alex Woodie shares some good news about Cloudera:

Things are starting to look up for Cloudera, which beat analyst expectations with its second quarter results announced yesterday. The distributed computing platform maker also gave investors something to cheer about with an optimistic financial forecast for the rest of fiscal 2021.

Cloudera, which trades on the New York Stock Exchange under the symbol CLDR, reported a non-GAAP profit of $0.10 per share for the second quarter of fiscal year 2021 ended July 31, exceeding analyst expectations by three cents. A year ago, it reported a non-GAAP loss of $.02 per share a year ago.

The company isn’t in great shape, but this is a good sign.

Comments closed

Join Operations in Spark

Swantika Gupta compares hash and merge join operations in Apache Spark:

One of the most frequently used transformations in Apache Spark is Join operation. Joins in Apache Spark allow the developer to combine two or more data frames based on certain (sortable) keys. The syntax for writing a join operation is simple but some times what goes on behind the curtain is lost. Internally, for Joins Apache Spark proposes a couple of Algorithms and then chooses one of them. Not knowing what these internal algorithms are, and which one does spark choose might make a simple Join operation expensive.

While opting for a Join Algorithm, Spark looks at the size of the data frames involved. It considers the Join type and condition specified, and hint (if any) to finally decide upon the algorithm to use. In most of the cases, Sort Merge join and Shuffle Hash join are the two major power horses that drive the Spark SQL joins. But if spark finds the size of one of the data frames less than a certain threshold, Spark puts up Broadcast Join as it’s top contender.

Click through for the comparison, though do note that this comparison doesn’t include nested loop joins, which are possible in Spark as well.

Comments closed

Apache Flink 1.10.2

Zhu Zhu announces Apache Flink 1.10.2:

The Apache Flink community released the second bugfix version of the Apache Flink 1.10 series.

This release includes 73 fixes and minor improvements for Flink 1.10.1. The list below includes a detailed list of all fixes and improvements.

We highly recommend all users to upgrade to Flink 1.10.2.

There are a lot of bugfixes in this release.

Comments closed

Spark SQL and Delta Lake on Spark 3.0

Denny Lee, et al, walk us through some improvements in Spark 3.0 around Spark SQL’s usage of Delta Lake:

One of most frequent questions through our Delta Lake Tech Talks was when would DML operations such as delete, update, and merge be available in Spark SQL?  Wait no more, these operations are now available in SQL!  Below are example of how you can write delete, update, and merge (insert, update, delete, and deduplication operations using Spark SQL

Read on for the full update.

Comments closed

A Quick Demo: Kafka to Spark Streaming to Cassandra

Kundan Kumarr walks us through a simple data pipeline:

Spark Structured Streaming is a component of Apache Spark framework that enables scalable, high throughput, fault tolerant processing of data streams.
Apache Kafka is a scalable, high performance, low latency platform that allows reading and writing streams of data like a messaging system.
Apache Cassandra is a distributed and wide-column NoSQL data store.

As I’m reading through this, I enjoyed just how straightforward the whole process was.

Comments closed

The %tensorboard Magic Command in Databricks Notebooks

Jerry Liang and Hossein Falaki introduce a new magic command in Databricks Runtime 7.2:

In 2017, we released the  dbutils.tensorboard.start() API to manage and use TensorBoard inside Databricks python notebooks. This API only permits one active TensorBoard process on a cluster at any given time – which hinders multi-tenant use-cases. Early last year, TensorBoard released its own API for notebooks via the %tensorboard python magic command. This API not only starts TensorBoard processes but also exposes the TensorBoard’s command line arguments in the notebook environment. In addition, it embeds the TensorBoard UI inside notebooks, whereas the dbutils.tensorboard.start API prints a link to open TensorBoard in a new tab.

Read on to see how you can use it.

Comments closed

Using Flink Stateful Functions

Igal Shilman gives us an example of where Apache Flink’s Stateful Functions come in handy:

In this blog post, we’ll take a look at a class of use cases that is a natural fit for Flink Stateful Functions: monitoring and controlling networks of connected devices (often called the “Internet of Things” (IoT)).

IoT networks are composed of many individual, but interconnected components, which makes getting some kind of high-level insight into the status, problems, or optimization opportunities in these networks not trivial. Each individual device “sees” only its own state, which means that the status of groups of devices, or even the network as a whole, is often a complex aggregation of the individual devices’ state. Diagnosing, controlling, or optimizing these groups of devices thus requires distributed logic that analyzes the “bigger picture” and then acts upon it.

A powerful approach to implement this is using digital twins: each device has a corresponding virtual entity (i.e. the digital twin), which also captures their relationships and interactions. The digital twins track the status of their corresponding devices and send updates to other twins, representing groups (such as geographical regions) of devices. Those, in turn, handle the logic to obtain the network’s aggregated view, or this “bigger picture” we mentioned before.

Read on to see where stateful functions come into play.

Comments closed