Press "Enter" to skip to content

Category: Hadoop

Data Management with Open Table Formats

Anandaganesh Balakrishnan covers a few open-source products and formats:

Apache Iceberg is an open-source table format designed for large-scale data lakes, aiming to improve data reliability, performance, and scalability. Its architecture introduces several key components and concepts that address the challenges commonly associated with big data processing and analytics, such as managing large datasets, schema evolution, efficient querying, and ensuring transactional integrity. Here’s a deep dive into the core components and architectural design of Apache Iceberg:

Click through for a review of Iceberg, Hudi, and the Delta Lake format.

Leave a Comment

Combining Kafka and Flink

Gautam Goswami shares some thoughts:

In short, the process of collecting data in real-time as streams of events from event sources such as databases, sensors, and software applications is known as event streaming. With real-time data processing and analytics in mind, Apache Flink is a potent open-source program. For situations where quick insights and minimal processing latency are critical, it offers a consistent and effective platform for managing continuous streams of data. 

I’ve found it interesting that Confluent people have spent a lot of time the past several months talking up Apache Flink and Kafka+Flink combinations.

Comments closed

Using Schema Registry for Data Quality in Apache Kafka

Kai Waehner talks data quality:

Good data quality is one of the most critical requirements in decoupled architectures, like microservices or data mesh. Apache Kafka became the de facto standard for these architectures. But Kafka is a dumb broker that only stores byte arrays. The Schema Registry enforces message structures. This blog post looks at enhancements to leverage data contracts for policies and rules to enforce good data quality on field-level and advanced use cases like routing malicious messages to a dead letter queue.

Click through to learn more about the topic. This focuses a lot on the “why” and “what” but does have an example of “how” in there as well.

Comments closed

Continuing the Advent of Fabric

Tomaz Kastrun has been busy. On day 9, we build a custom environment:

Microsoft Fabric provides you with the capability to create a new environment, where you can select different Spark runtimes, configure your compute resources, and create a list of Python libraries (public or custom; from Conda or PyPI) to be installed. Custom environments behave the same way as any other environment and can be used and attached to your notebook or used on a workspace. Custom environments can also be attached to Spark job definitions.

On day 10, we have Spark job definitions:

An Apache Spark job definition is a single computational action, that is normally scheduled and triggered. In Microsoft Fabric (same as in Synapse), you could submit batch/streaming jobs to Spark clusters.

By uploading a binary file, or libraries in any of the languages (Java / Scala, R, Python), you can run any kind of logic (transformation, cleaning, ingest, ingress, …) to the data that is hosted and server to your lakehouse.

Day 11 introduces us to data science in Fabric:

We have looked into creating the lakehouse, checked the delta lake and delta tables, got some data into the lakehouse, and created a custom environment and Spark job definition. And now we need to see, how to start working with the data.

Day 12 builds an experiment:

We have started working with the data and now, we would like to create and submit the experiment. In this case, MLFlow will be used here.

Create a new experiment and give it a name. I have named my “Advent2023_Experiment_v3”.

Click through to catch up with Tomaz.

Comments closed

Databricks Security Analysis Tool

Advait Bhadane takes a look at a tool:

In today’s data-driven world a cutting-edge platform is required that seamlessly integrates with the cloud, embraces open-source innovation and prioritises robust data security. Databricks is a pioneer in this field. Not only does it provide a unified lake house platform, but it takes data protection to the next level with its Security Analysis Tool (SAT).

In this blog, we will unravel the power of Databricks’ SAT, focusing on the pivotal role it plays in generating daily health reports for your workspaces. It will also walk you through the step-by-step process of setting SAT in your workspace.

Click through to see what this tool can do for you.

Comments closed

Producing Messages with librdkafka

Jakub Korab dives into a Kafka library:

In a previous blog post (How To Survive an Apache Kafka® Outage) I outlined the effects on applications during partial or total Kafka cluster outages and proposed some architectural strategies to handle these types of service interruptions. The applications most heavily impacted by this type of outage are external interfaces that receive data, do not control request flow, and possibly perform some form of business transaction with the outside world before producing to Kafka. These applications are most commonly found in finance and written in languages other than Java—mostly C and C++. 

librdkafka is the main underlying client library used in non-JVM environments and has wrapper libraries for Python, .Net, Go, and an ever-expanding list of clients. It has not been written about to the same extent as the Java client, and it is worth examining as its interface and underlying mechanics are fundamentally different. 

This library is quite useful and versatile.

Comments closed

Key Constraints in Databricks Unity Catalog

Meagan Longoria gives us a warning:

I’ve been building lakehouses using Databricks Unity catalog for a couple of clients. Overall, I like the technology, but there are a few things to get used to. This includes the fact that primary key and foreign key constraints are informational only and not enforced.

If you come from a relational database background, this unenforced constraint may bother you a bit as you may be used to enforcing it to help with referential integrity. 

Read on to see what is available and why it can nonetheless be useful in some circumstances.

Comments closed

Lakehouse Management in Fabric via mssparkutils

Sandeep Pawar scripts out some lakehouse work:

At MS Ignite, Microsoft unveiled a variety of new APIs designed for working with Fabric items, such as workspaces, Spark jobs, lakehouses, warehouses, ML items, and more. You can find detailed information about these APIs here. These APIs will be critical in the automation and CI/CD of Fabric workloads.

With the release of these APIs, a new method has been added to the mssparkutils library to simplify working with lakehouses. In this blog, I will explore the available options and provide examples. Please note that at the time of writing this blog, the information has not been published on the official documentation page, so keep an eye on the documentation for changes.

This looks to be quite useful for CI/CD work.

Comments closed

An Overview of Data Lake Operations with Apache NiFi

Lav Kumar gives us a 10,000 foot view:

In the world of data-driven decision-making, ETL (Extract, Transform, Load) processes play a pivotal role. The effective management and transformation of data are essential to ensure that businesses can make informed choices based on accurate and relevant information. Data lakes have emerged as a powerful way to store and analyze massive amounts of data, and Apache NiFi is a robust tool for streamlining ETL processes in a data lake environment.

Read on for a brief primer on NiFi and how some of its capabilities can assist in ETL and ELT processing.

Comments closed

Apache Zookeeper Vulnerability

The Instaclustr team reviews an announcement:

On October 11, 2023, the Apache ZooKeeper™ project announced that a security vulnerability has been identified in Apache ZooKeeper, CVE-2023-44981. The Apache ZooKeeper project has classified the severity of this CVE as critical. The CVSS (Common Vulnerability Scoring System) 3.x severity rating for this vulnerability by the NVD (National Vulnerability Database) is base score 9.1 Critical.  

That’s a rather high base score and is comes about if you have the setting quorum.auth.enableSasl=true. Updating to the Zookeeper 3.7.2 or alter, 3.8.3 or later, or anything in the 3.9 branch will fix this vulnerability.

Comments closed