Press "Enter" to skip to content

Category: Hadoop

Querying the Power BI REST API from Fabric Spark

Gerhard Brueckl makes the call:

Microsoft Fabric has a lot of different components which usually work very well together. However, even though Power BI is a fundamental part of Fabric, there is not really a tight integration between Data Engineering components and Power BI. In this blog post I will show you an easy and reusable way to query the Power BI REST API via Fabric SQL in a very straight forward way. The extracted data can then be stored in the data lake e.g. to create a history of your dataset refreshes, the state of your workspaces or any other information that is provided by the REST API.

Click through for a list of operations, followed by the code you’ll need to pull this off.

Comments closed

Tuning Kafka Connect Source Connectors

Catalin Pop makes things faster:

Kafka Connect is an open source data integration tool that simplifies the process of streaming data between Apache Kafka® and other systems. Kafka Connect has two types of connectors: source connectors and sink connectors. Source connectors allow you to read data from various sources and write it to Kafka topics. Sink connectors send data from the topics to another endpoint. This blog post discusses how to tune your source connectors to help you get the best throughput out of your compute resources. 

This includes which elements are tunable, metrics you’ll want to pay attention to along the way, and a detailed example.

Comments closed

Flink Streaming Use Cases for Kafka Users

Jean-Sebastien Brunner gives us some use cases:

In Part One of our “Inside Flink” blog series, we explored the critical role of stream processing and why developers are increasingly choosing Apache Flink® over other frameworks. 

In this second installment, we’ll showcase how innovative teams across every industry and size are putting stream processing into practice – from streaming data pipelines to train ML models or more timely analytics to fraud detection in finance and real-time inventory management in retail. We’ll also discuss how Flink is uniquely suited to support a wide spectrum of use cases and helps teams uncover immediate insights in their data streams and react to events in real time.

This article stays more at the “art of the possible” level rather than drilling into how we can do it.

Comments closed

Versioned State Store in Kafka Streams

Victoria Xia announces new functionality in Apache Kafka 3.5:

Since the introduction of stream processing, there have been three certainties in life: death, taxes, and out-of-order data. As a stream processing library built for Apache Kafka, Kafka Streams processes data in offset order. When out-of-order data is present, offset order differs from timestamp order and care must be taken to ensure that processing results respect timestamp order where appropriate. The introduction of versioned state stores to Kafka Streams in the Apache Kafka 3.5 release is a huge milestone in this direction.

In this blog post, I’ll address the what, why, and how of versioned stores in Kafka Streams, including what they are, why you might like to use them, how to get started, and a couple of things to watch out for when upgrading.

Read on to see what this entails and how you can try it out yourself.

Comments closed

Stream Processing with Flink and Kafka

Konstantin Knauf starts a new series:

There was a huge amount of buzz about Apache Flink® at this year’s Kafka Summit London. From an action-packed keynote to standing-room only breakout sessions, it’s clear that the Apache Kafka® community is hungry to learn more about Flink and how the stream processing framework fits into the modern data streaming stack.

That’s why we’re excited to introduce our new “Inside Flink” blog series that takes a deeper look at why developers and organizations everywhere are shifting their stream processing technologies to Flink. Our first blog post explains what Flink is and how it can enhance your streaming use cases running on Kafka. Future topics will include common Flink use cases, an inside look at Flink SQL, and much more.

Click through for the first post in the series, which covers what Flink is and how the two products can interoperate.

Comments closed

Adaptive Query Execution in Spark 3.0

The Big Data in Real World team talks on-the-fly execution plan changes:

Apache Spark, the popular distributed computing framework, has been widely adopted for processing large-scale data. With the release of Apache Spark 3.0, a groundbreaking feature called Adaptive Query Execution (AQE) was introduced. AQE addresses the limitations of traditional static execution plans by dynamically optimizing query execution based on runtime statistics. In this blog post, we will explore how AQE works and how it significantly improves the performance of Spark applications.

This sounds pretty similar to adaptive query processing in SQL Server, though a look at the Spark documentation shows that there are some practical differences in implementation versus what SQL Server does.

Comments closed

Microsoft Fabric Notebooks and Compute Limits

Reitse Eskens hits a wall:

In this case, my notebook threw an error at me but the command seemed to finish without any issue. Sounds vague? It did to me. The notebookcell I tried to run had a lot of stuff happening at the same time.

As you can see in the above screenshot, the status shows green checkmarks but there’s an error as well. The error message was not really clear to me, but that can really be me lack of deep level experience. So, I logged a call with Microsoft Support and see what they could come up with.

I’ve had enough experience with Spark to see the issue and figure the response, but click through for the screenshot and what Reitse did to resolve the issue.

Comments closed

Creating a Simple Date Dimension in Databricks

Chen Hirsh builds a table:

A date dimension is extremely useful and is required by most BI applications. This kind of dimension has a key of time level (day, month, etc.), and attributes that describe it such as year, month, etc. In your BI model, you join this dimension to facts on their date fields, to aggregate from day level to week, month, and year.

In this post, I will demonstrate how to create a date dimension on Azure Databricks using Python. A link to the complete Databricks notebook is at the end of the post.

Check out the code, as well as explanation, in that post.

Comments closed

Recommendations for Snapshotting HDFS

Tsz Sze provides some recommendations:

The snapshots feature of the Apache Hadoop Distributed Filesystem (HDFS) enables you to capture point-in-time copies of the file system and protect your important data against corruption, user-, or application errors.  This feature is available in all versions of Cloudera Data Platform (CDP), Cloudera Distribution for Hadoop (CDH) and Hortonworks Data Platform (HDP). Regardless of whether you’ve been using snapshots for a while or contemplating their use, this blog gives you the insights and techniques to make them look their best.  

Read on for those tips and tricks.

Comments closed

Key Metrics for Kafka Clusters

Paul Brebner tracks some measurements:

Like a steam locomotive, Apache Kafka also has lots of “moving parts”. A Kafka cluster is actually a distributed system, a cluster, which consists of multiple brokers (servers)—messages are sent to Kafka with producers, and consumers receive messages. Topics are used to direct messages between producers and consumers—producers write to selected topics, and consumers subscribe and read from selected topics. Topics are divided up into partitions, and the partitions are distributed over the available brokers for high availability and concurrency. Partitions are replicated to other partitions (followers) from the leader partition (3 is common for production clusters).  

Click through for the description of metrics on brokers, topics, consumer groups, and more, as well as an extended analogy to steam locomotives which works surprisingly well.

Comments closed