Press "Enter" to skip to content

Curated SQL Posts

Streaming Foreign Key Joins in Kafka Streams

John Roesler and Adam Bellemare take us in depth on a feature:

Before 2.4.0, the absence of foreign-key joins in Kafka Streams was palpable. As soon as you have a KTable abstraction, you start to think of relational-DB-esque things that you’d like to do with it, and joining two tables is near the top of the list. In addition, Kafka users often started out by implementing change data capture (CDC) of their main database tables, resulting in the production of normalized record streams reflecting the database model. These records often contain foreign-key references, requiring you to either denormalize entirely within your source database (which can be quite expensive), or handle them downstream in your consumer. The ability to compute denormalization on the fly is exactly in the sweet spot of use cases for Kafka Streams.

In versions prior to 2.4, there were workarounds available to compute a foreign-key join, using the ability to transform the table, filter it, aggregate on properties, and join on primary keys. But these workarounds were complex, prone to bugs, and not very efficient. A concrete plan to implement first-class support for this crucial operation was first put together when Jan Filipiak proposed KIP-213 in 2017. Adam Bellemare took over driving the proposal in 2018 and brought it to a conclusion in time for the 2.4.0 release.

Click through for examples of how it all works, as well as how you might optimize foreign key joins.

Comments closed

Optimizing BERT Models on Google Colab

Kevin Jacobs fine-tunes some NLP processes:

BERT is a language model and can thus be used for predicting the next word in a sentence. Furthermore, BERT can be used for automatic summarization, text classification and many more downstream tasks. Google Colab provides you with a cloud-based environment on which you can train your machine learning models on a GPU. The downside is that your data is uploaded to the Google cloud. Google Colab gives you the opportunity to finetune BERT.

Click through to see how.

Comments closed

Identifying Backpressure in Apache Flink

Piotr Nowojski explains an important concept in streaming (and ELT/ETL) products:

The backpressure topic was tackled from different angles over the last couple of years. However, when it comes to identifying and analyzing sources of backpressure, things have changed quite a bit in the recent Flink releases (especially with new additions to metrics and the web UI in Flink 1.13). This post will try to clarify some of these changes and go into more detail about how to track down the source of backpressure, but first…

Read on for the full story, including a review of the concept and its importance.

Comments closed

Reasons to Use Tidymodels

Roel Hogervorst explains when we may or may not want to use tidymodels versus rolling our own models in R:

When not

you are always using GLM models. (they are very flexible!) it makes no sense to me to go for the extra {parsnip} layer if you are always using the same models. You could still consider using recipes to feature engineer.

– If you are familiar with the kind of data and what models will work on that data. Basically you are an expert on this field and have worked on it for many years. There is no need to experiment.

Read on for concrete examples of when it does make sense. H/T R-Bloggers.

Comments closed

UI Patterns which Clash with Database Patterns

Michael J. Swart explains why we can’t have nice things:

I spend a large amount of time translating software requirements into schema and queries. These requirements are sometimes easy to implement but are often difficult. I want to talk about UI design choices that lead to data access patterns that are awkward to implement using SQL Server.

Read on for three such examples, including sorting, paging, and search.

Comments closed

Enabling Trace Flags in SQL Server

Robert Sheldon performs some level-setting:

SQL Server includes a set of configurable options known as trace flags. You can use trace flags to set server characteristics and control different types of operations. SQL Server offers a wide range of trace flags that let you modify the platform’s default behavior to meet specific requirements. Trace flags can help you when performing such tasks as testing stored procedures, diagnosing performance issues, or debugging complex computer systems. Microsoft Support might also recommend using certain trace flags to address behavior that’s impacting specific workloads. This article explains how to enable SQL Server trace flags.

Click through for the article.

Comments closed

The Power BI Adoption Roadmap

Melissa Coates has a new document:

For the most part it’s targeted to orgs who have Power BI deployed to a certain extent, yet know there’s room for improvement. We focus mostly on the harder things that are more difficult to manage than the technology itself.

Although I did the writing and diagram creation, I did so with Matthew Roche’s direction, advice, and guidance. That man is a wealth of knowledge. If you’re reading this post, then you’re probably familiar with his series on building a data culture. If you haven’t reviewed that series thoroughly, please do. You’ll recognize a lot of common themes from his content in this new adoption roadmap.

Click through for some Q&A and information on where you can get the roadmap.

Comments closed

Using DBCC INPUTBUFFER

Monica Rathbun shows us how to use DBCC INPUTBUFFER:

A command I like to use when performance tuning is DBCC INPUTBUFFER. If you have ever run sp_whoisactive or sp_who2 to find out what sessions are executing when CPU is high for instance this can be a real quick life saver. At times, for me, those two options do not return enough information for what I’m looking for which is the associated stored procedure or object. Using this little helper along with the session id can easily get you that information.

Let’s take a look.

Let’s.

Comments closed

Having Fun with the QDS Toolbox

Jared Poche shares a few queries with us:

The QDS Toolbox is set of tools that can help you review and store the performance related data in Query Store. This was released by ChannelAdvisor last September thanks to the hard work of a number of my coworkers.

If you aren’t experienced with Query Store, this can provide a good starting point for getting familiar with data that is available and what you can do with it. If you are experienced with Query Store, this may give you an easy way to set up customizable reports that help you find issues and see trends.

The QDS Toolbox has several components, and I intend to post about each in turn. Two new components were added to this recently by @sqlozano (https://www.sqlozano.com/), bringing the current total to eight.

Click through for a deeper dive into the Server Top Queries report.

Comments closed

Notes on Temporal Tables

Hugo Kornelis wraps up a discussion of temporal tables with miscellany:

Of course, it will be a quite common requirement to query products and their suppliers. When querying the present, you can just access the Suppliers and Products table without the “FOR SYSTEM_TIME” keyword, and the execution plan will unsurprisingly show that a regular join of the two “current” tables is used, with no reference to the history table. Nothing special. We already saw in the earlier parts that querying the present simply ignores the history table; joining does not make that different.

But what if, for instance, we want to show all data as it was valid on June 3 at noon? Well, that is also simple. We already know that we can use FOR SYSTEM_TIME AS OF” to get the rows from a single temporal table as they were at a specific time. We can use that syntax for both tables, to get the data we need:

Read on as Hugo dives into some messy problems. Temporal table queries can expand out in complexity very quickly, as this post shows.

Comments closed