Press "Enter" to skip to content

Category: Hadoop

Cloning Delta Lakes

Burak Yavuz and Pranav Anand show us how to clone Delta Lakes:

Clones are replicas of a source table at a given point in time. They have the same metadata as the source table: same schema, constraints, column descriptions, statistics, and partitioning. However, they behave as a separate table with a separate lineage or history. Any changes made to clones only affect the clone and not the source. Any changes that happen to the source during or after the cloning process also do not get reflected in the clone due to Snapshot Isolation. In Databricks Delta Lake we have two types of clones: shallow or deep.

Read on to learn the differences, as well as a few useful scenarios.

Comments closed

The Count Window in Flink

Kundan Kumarr takes us through an example of the count window type in Apache Flink:

In the blog, we learned about Tumbling and Sliding windows which is based on time. In this blog, we are going to learn to define Flink’s windows on other properties i.e Count window. As the name suggests, count window is evaluated when the number of records received, hits the threshold.

Count window set the window size based on how many entities exist within that window. For example, if we fixed the count as 4, every window will have exactly 4 entities. It doesn’t matter whats the size of the window in terms of time. Window size will be different but the number of entities in that window will always be the same. Count windows can have overlapping windows or non-overlapping, both are possible. The count window in Flink is applied to keyed streams means there is already a logical grouping of the stream based on all values associated with a certain key. So the entity count will apply on a per-key basis.

I’m curious if there’s a combination of count + time, triggering when you hit X elements or Y seconds, whichever comes first.

Comments closed

Renaming Cached DataFrames in Spark

Landon Robinson works around an annoyance:

But DataFrames have not been given the same, clear route to convenient renaming of cached data. It has, however, been attempted and requested by the community:

https://forums.databricks.com/questions/6525/how-to-setname-on-a-dataframe.html
https://issues.apache.org/jira/browse/SPARK-8480

However, with the below approach, you can start naming your DataFrames all you want. It’s very handy.

Read on to see the solution in action.

Comments closed

A Kafka Alternative for Message Prioritization

Ricardo Ferreira explains why Apache Kafka doesn’t support message prioritization and gives a second-best solution when you do need it:

Users of messaging technologies such as JMS and AMQP often use message prioritization so that messages can be processed in a different order based on their importance. It doesn’t take much imagination to see where this can be useful—call center companies will want to handle the most severe cases first, then others; airline companies give preference to their service treats to customers with higher status, and telecommunication companies would probably give their most loyal customers a better renewal promotion.

One of the misconceptions that developers have when they first encounter Apache Kafka® is that it is just another messaging system, just like the ones they’re familiar with. By extension from that, they get disappointed when they find that Kafka doesn’t offer message prioritization.

Read on for the answers.

Comments closed

Indexing S3 Data with CDP Data Hub

Eva Nahari, et al, show how to perform indexing and serving of S3 data in Cloudera Data Platform:

This blog post will present a simple “hello world” kind of example on how to get data that is stored in S3 indexed and served by an Apache Solr service hosted in a Data Discovery and Exploration cluster in CDP. For the curious: DDE is a pre-templeted Solr-optimized cluster deployment option in CDP, and recently released in tech preview. We will only cover AWS and S3 environments in this blog. Azure and ADLS deployment options are also available in tech preview, but will be covered in a future blog post.

We will depict the simplest scenario to make it easy to get started. There are of course more advanced data pipeline setups and more rich schemas possible, but this is a good starting point for a beginner. 

Read on for the instructions.

Comments closed

A Review of KaDeck: Kafka Management Tool

Guy Shilo reviews KaDeck:

If you want the real web version then you need to register in their website and create a team first. Yes, KaDeck is aimed at teams of people working together. You create a team online and shortly you receive a mail with team id and a secret key that you will use when running KaDeck web..

It seems that there is no host installation of KaDeck right now (rpm package, binaries or so) but only a Docker image. This forces the user to run it in Docker or a Kubernetes cluster (or one of it’s commercial distributions such as OpenShift). Another thing is that checks your license online each time you start the container. If you want to use t offline you have to do a process of offline activation. You can reach it from the administration menu.

Click through for Guy’s thoughts on the product.

Comments closed

Spark SQL in Delta Lake

Kundan Kumarr walks us through some of the basic SQL operations you can perform with Delta Lake in Apache Spark:

Nowadays Delta lake is a buzz word in the Big Data world, especially among the spark developers because it relegates lots of issues found in the Big Data domain. Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. It is evolving day by day and adds cool features in its every release. On 19th June 2020, Delta lake version 0.7.0 was released and this is the first release on Spark 3.x. This release involves important key features that can make the spark developer’s work easy.

One of the interesting key features in this release is the support for metastore-defined tables and SQL DDLs. So now we can define Delta tables in the Hive metastore and use the table name in all SQL operations. We can perform SQL DDLs to create tables, insert into tables, explicitly alter the schema of the tables, and so on. So in this blog, we will learn how we can perform SQL DDLs/DMLS/DQL in Delta Lake 0.7.0.

Click through for the examples.

Comments closed

Tumbling and Sliding Windows in Flink

Kundan Kumarr takes us through two different window types in Apache Flink:

In the previous blog, we talked about Flink’s windows operator, a heart of processing infinite streams. Generally in Flink, after specifying that the stream is keyed or non keyed, the next step is to define a window assigner. The window assigner defines how elements are assigned to windows. Flink provides some useful predefined window assigners like Tumbling windowsSliding windowsSession windows, Count windows, and Global windows. We can use any of them as per our use case or even we can create custom window assigners in Flink.

In this blog, we will learn about the first two window assigners i.e., Tumbling and sliding windows. These two window assigners, assign elements to windows based on time, which can either be processing time or event time.

Click through for a description of each.

Comments closed

ACID Transactions with Hive LLAP in ElasticMapReduce

Suthan Phillips and Chao Gao walk us through ACID transactions when using Hive on Amazon’s ElasticMapReduce platform:

ACID (atomicity, consistency, isolation, and durability) properties make sure that the transactions in a database are atomic, consistent, isolated, and reliable.

Amazon EMR 6.1.0 adds support for Hive ACID transactions so it complies with the ACID properties of a database. With this feature, you can run INSERT, UPDATE, DELETE, and MERGE operations in Hive managed tables with data in Amazon Simple Storage Service (Amazon S3). This is a key feature for use cases like streaming ingestion, data restatement, bulk updates using MERGE, and slowly changing dimensions.

This post demonstrates how to enable Hive ACID transactions in Amazon EMR, how to create a Hive transactional table, how it can achieve atomic and isolated operations, and the concepts, best practices, and limitations of using Hive ACID in Amazon EMR.

Click through for a demonstration.

Comments closed