Hadoop – Curated SQL

Virtualizing Hadoop Data into OneLake via Apache Ozone

Published 2025-03-21 by Kevin Feasel

James Morantus hits us with a blast from the past:

Microsoft Fabric OneLake shortcuts facilitate the virtualization of data from various cloud object stores and on-premises environments. For on-premises sources like Cloudera/Apache Ozone, the OneLake S3 Compatible Shortcut can be utilized to connect to these data sources. With OneLake Shortcuts, users can create a virtual reference to their Cloudera cluster data without moving or duplicating the data. To learn more about Fabric OneLake shortcuts, reference this blog OneLake with shortcuts.

If this was a decade ago, I’d be a lot more excited. But it is kind of wild how quickly the data landscape changed, with the adoption of Spark over classic Hadoop; cloud-based data lakes over HDFS; and more focused dataset sizes over “give me everything.”

Comments closed

Tracking Airport Traffic with Flink, Kafka, and NiFi

Published 2024-09-30 by Kevin Feasel

Tim Spann builds an app:

The above link utilizes the standard REST link and enhances it by setting the beginning date using NiFi’s Expression language to get the current time in UNIX format in seconds. In this example, I am looking at the last week of data for the airport departures and arrivals in the second URL.

We iterate through a list of the largest airports in the United States doing both departures and arrivals since they use the same format.

Read the article to learn more about how you can tie it all together. You can also check out Tim’s GitHub repo to grab the code.

Comments closed

Data Management with Open Table Formats

Published 2024-03-07 by Kevin Feasel

Anandaganesh Balakrishnan covers a few open-source products and formats:

Apache Iceberg is an open-source table format designed for large-scale data lakes, aiming to improve data reliability, performance, and scalability. Its architecture introduces several key components and concepts that address the challenges commonly associated with big data processing and analytics, such as managing large datasets, schema evolution, efficient querying, and ensuring transactional integrity. Here’s a deep dive into the core components and architectural design of Apache Iceberg:

Click through for a review of Iceberg, Hudi, and the Delta Lake format.

Comments closed

Combining Kafka and Flink

Published 2024-02-20 by Kevin Feasel

Gautam Goswami shares some thoughts:

In short, the process of collecting data in real-time as streams of events from event sources such as databases, sensors, and software applications is known as event streaming. With real-time data processing and analytics in mind, Apache Flink is a potent open-source program. For situations where quick insights and minimal processing latency are critical, it offers a consistent and effective platform for managing continuous streams of data.

I’ve found it interesting that Confluent people have spent a lot of time the past several months talking up Apache Flink and Kafka+Flink combinations.

Comments closed

Using Schema Registry for Data Quality in Apache Kafka

Published 2024-01-05 by Kevin Feasel

Kai Waehner talks data quality:

Good data quality is one of the most critical requirements in decoupled architectures, like microservices or data mesh. Apache Kafka became the de facto standard for these architectures. But Kafka is a dumb broker that only stores byte arrays. The Schema Registry enforces message structures. This blog post looks at enhancements to leverage data contracts for policies and rules to enforce good data quality on field-level and advanced use cases like routing malicious messages to a dead letter queue.

Click through to learn more about the topic. This focuses a lot on the “why” and “what” but does have an example of “how” in there as well.

Comments closed

Continuing the Advent of Fabric

Published 2023-12-13 by Kevin Feasel

Tomaz Kastrun has been busy. On day 9, we build a custom environment:

Microsoft Fabric provides you with the capability to create a new environment, where you can select different Spark runtimes, configure your compute resources, and create a list of Python libraries (public or custom; from Conda or PyPI) to be installed. Custom environments behave the same way as any other environment and can be used and attached to your notebook or used on a workspace. Custom environments can also be attached to Spark job definitions.

On day 10, we have Spark job definitions:

An Apache Spark job definition is a single computational action, that is normally scheduled and triggered. In Microsoft Fabric (same as in Synapse), you could submit batch/streaming jobs to Spark clusters.

By uploading a binary file, or libraries in any of the languages (Java / Scala, R, Python), you can run any kind of logic (transformation, cleaning, ingest, ingress, …) to the data that is hosted and server to your lakehouse.

Day 11 introduces us to data science in Fabric:

We have looked into creating the lakehouse, checked the delta lake and delta tables, got some data into the lakehouse, and created a custom environment and Spark job definition. And now we need to see, how to start working with the data.

Day 12 builds an experiment:

We have started working with the data and now, we would like to create and submit the experiment. In this case, MLFlow will be used here.

Create a new experiment and give it a name. I have named my “Advent2023_Experiment_v3”.

Click through to catch up with Tomaz.

Comments closed

Databricks Security Analysis Tool

Published 2023-12-11 by Kevin Feasel

Advait Bhadane takes a look at a tool:

In today’s data-driven world a cutting-edge platform is required that seamlessly integrates with the cloud, embraces open-source innovation and prioritises robust data security. Databricks is a pioneer in this field. Not only does it provide a unified lake house platform, but it takes data protection to the next level with its Security Analysis Tool (SAT).

In this blog, we will unravel the power of Databricks’ SAT, focusing on the pivotal role it plays in generating daily health reports for your workspaces. It will also walk you through the step-by-step process of setting SAT in your workspace.

Click through to see what this tool can do for you.

Comments closed

Producing Messages with librdkafka

Published 2023-12-06 by Kevin Feasel

Jakub Korab dives into a Kafka library:

In a previous blog post (How To Survive an Apache Kafka® Outage) I outlined the effects on applications during partial or total Kafka cluster outages and proposed some architectural strategies to handle these types of service interruptions. The applications most heavily impacted by this type of outage are external interfaces that receive data, do not control request flow, and possibly perform some form of business transaction with the outside world before producing to Kafka. These applications are most commonly found in finance and written in languages other than Java—mostly C and C++.

l ibrdkafka is the main underlying client library used in non-JVM environments and has wrapper libraries for Python, .Net, Go, and an ever-expanding list of clients. It has not been written about to the same extent as the Java client, and it is worth examining as its interface and underlying mechanics are fundamentally different.

This library is quite useful and versatile.

Comments closed

Key Constraints in Databricks Unity Catalog

Published 2023-11-24 by Kevin Feasel

Meagan Longoria gives us a warning:

I’ve been building lakehouses using Databricks Unity catalog for a couple of clients. Overall, I like the technology, but there are a few things to get used to. This includes the fact that primary key and foreign key constraints are informational only and not enforced.

If you come from a relational database background, this unenforced constraint may bother you a bit as you may be used to enforcing it to help with referential integrity.

Read on to see what is available and why it can nonetheless be useful in some circumstances.

Comments closed

Lakehouse Management in Fabric via mssparkutils

Published 2023-11-21 by Kevin Feasel

Sandeep Pawar scripts out some lakehouse work:

At MS Ignite, Microsoft unveiled a variety of new APIs designed for working with Fabric items, such as workspaces, Spark jobs, lakehouses, warehouses, ML items, and more. You can find detailed information about these APIs here. These APIs will be critical in the automation and CI/CD of Fabric workloads.

With the release of these APIs, a new method has been added to the mssparkutils library to simplify working with lakehouses. In this blog, I will explore the available options and provide examples. Please note that at the time of writing this blog, the information has not been published on the official documentation page, so keep an eye on the documentation for changes.

This looks to be quite useful for CI/CD work.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Category: Hadoop