Press "Enter" to skip to content

Author: Kevin Feasel

Spring Cleaning for Lakehouse Tables with VACUUM

Chen Hirsh says it’s time to do a bit of cleanup:

Delta tables create new files for every change made to the table (insert, update, delete). You can use the old files to “time travel” – to query or restore older versions of your table. This is a wonderful feature, but over time, these files accumulate in your storage and will increase your storage costs.

Read on for a primer of the VACUUM command, how frequently you might want to run the command, and how much data you want to save. This example is specifically around using Databricks, but the mechanisms work exactly the same for other lakehouses like Microsoft Fabric.

Comments closed

Building a Microsoft Fabric Notification Bot for Teams

Stepan Resl gets the message:

Power BI provided a wide range of notification options, each with its own advantages and disadvantages, often requiring us to combine them to achieve the desired result. Historically, we primarily relied on Subscriptions, Alerts, Outages Emails, and Refresh Error Emails. Combining these options was complex and inefficient. With the introduction of the ExecuteQuery endpoint in the Power BI REST API, entirely new possibilities opened up. By sending a DAX query to a selected dataset and processing the returned data, we can create virtually any custom notification, enabling data-driven notifications. This endpoint also came with built-in support for Microsoft Power Automate and Logic Apps, significantly simplifying the process due to their native connectors to other services like ExchangeSharePoint, and Microsoft Teams (including support for Adaptive Cards).

Read on for a bit of history, why a custom solution was necessary for Stepan, and the process of building out that solution.

Comments closed

ReplaceInName in dbatools’ Backup-DbaDatabase

Jess Pomfret discovers a new flag:

Recently I was reading the docs for `Backup-DbaDatabase` and found a parameter I didn’t realise existed, but is so useful when you want to automate backups, but keep control of the file names.

Click through to learn more about the specific feature, as well as a reminder that it’s a good idea to read through the documentation. Not all documentation is good, but the work people have put into dbatools means that there is often a good example involving most of the available parameters. It turns out that there are, in fact, two examples that use -ReplaceInName in the documentation for Backup-DbaDatabase, so you get not only a description of the parameter but also two specific examples of how to use it.

Comments closed

Troubleshooting a Slow Mapping Data Flow in Azure Synapse Analytics

Reitse Eskens has the need for speed:

The issue was quite straightforward. The client has a mapping data flow in Synapse that processes a few hundred to a few thousand rows but takes 15 minutes to complete. The low number of rows compared to the time necessary is a cause for concern.

The data extraction needs a staging storage account where the data is written into TXT files. The second step of the mapping data flow reads the TXT files and writes them out in delta format, which is Parquet files.

The source is an S4Hana CDC table, the target of which is a regular Azure storage account.

Read on for Reitse’s summarization of the troubleshooting and testing process, as well as what ended up working for this customer.

Comments closed

Creating a Dashboard in Snowflake

Kevin Wilkie doesn’t just do e-mails:

If you haven’t had a chance to mess around in Snowflake, there are internal dashboards that you can provide to your users (or just use yourself) so that you can get the latest data.

Today, I want to quickly show a use case for these Dashboards and one way that you can get data quickly and easily in a nice format.

And right above Dashboards is Streamlit, one of my favorite libraries for building rapid application prototypes and even small data applications.

Comments closed

Comparing Apache Iceberg to Delta Lake

Maria Zakourdaev compares technologies:

Public cloud blob storage has been a standard for data lakes for the last 10 years. Blob storage, at first, came to solve data warehouse storage limitations. It is very cheap and has unlimited capacity. You can store any data format (structured, semi-structured, or unstructured) in the data lake located on a blob storage, and keep any amount of raw data for an unlimited time. When considering Apache Iceberg vs Delta Lake, both can manage data efficiently. Depending on the access frequency, data can be stored on cold or warm types of cloud storage, saving even more costs.

Read on to see how the two techniques compare along several dimensions, as well as some general guidance at the end on which to choose.

Comments closed

Troubleshooting an Apache Flink Job Not Producing Results

Wade Waldron digs in:

Imagine that you have built an Apache Flink® job. It collects records from Apache Kafka®, performs a time-based aggregation on those records, and emits a new record to a different topic. With your excitement high, you run the job for the first time, and are disappointed to discover that nothing happens. You check the input topic and see the data flowing, but when you look at the output topic, it’s empty.

In many cases, this is an indication that there is a problem with watermarks. But what is a watermark?

Read on for a primer on watermarks, followed by an explanation of the common solution to the problem Wade describes.

Comments closed

Hierarchy Management in PostgreSQL

Boris Novikov builds a list:

The Postgres database management system was originally designed as object-relational. We remind why PostgreSQL is an object-relational rather than just relational database management system, and briefly discuss what kind of object-oriented features it offers. We demonstrate how these features can be used to manage hierarchies, and how applications can benefit using them.

Click through for a technique native to PostgreSQL.

Comments closed

A Primer on Wait Stats in SQL Server

Kevin Hill wants to know where your server hurts:

Everything in SQL Server is waiting for something…this is by design. And most people never think about it…

Until your application is waiting to complete its request, sell the product or provide “the report” to the CEO!

Most SQL Server performance issues come down to one thing: waiting. Every time a query slows down, SQL Server records what it was waiting on. These wait stats are one of the best diagnostic tools for DBAs, sysadmins, and developers – but only if you know how to use them.

Click through for the explanation. Kevin does a great job of proving the 100-level explanation of wait stats to new DBAs, giving them an understanding of what wait stats are, why they are important, and how we can use them to troubleshoot a performance issue. After really locking in the 100-level explanation, it’s a good idea to come back later and understand the difference between signal and non-signal waits (and when that’s relevant), how to track wait stats as part of a query execution plan, and how to correlate wait stats with other measures to find the root cause sooner.

Comments closed