Press "Enter" to skip to content

Author: Kevin Feasel

Alternatives to the Dead Letter Queue in Apache Kafka

Kai Waehner can’t return to sender:

This article focuses on the data streaming platform Apache Kafka. The main reason for putting a message into a DLQ in Kafka is usually a bad message format or invalid/missing message content. For instance, an application error occurs if a value is expected to be an Integer, but the producer sends a String. In more dynamic environments, a “Topic does not exist” exception might be another error why the message cannot be delivered.

Therefore, as so often, don’t use the knowledge from your existing middleware experience. Message Queue middleware, such as JMS-compliant IBM MQ, TIBCO EMS, or RabbitMQ, works differently than a distributed commit log like Kafka. A DLQ in a message queue is used in message queuing systems for many other reasons that do not map one-to-one to Kafka. For instance, the message in an MQ system expires because of per-message TTL (time to live).

Hence, the main reason for putting messages into a DLQ in Kafka is a bad message format or invalid/missing message content.

Read on to learn the Kafka-based approach to dealing with bad messages rather than using a Dead Letter Queue.

Comments closed

Understanding Missing Index Impact

Erik Darling delves into the depths of missing indexes:

Breaking each of those down, the only one that has a concrete meaning is Uses, but that of course doesn’t mean that a query took a long time or is even terribly inefficient.

That leaves us with Average Query Cost, which is the sum of each operator’s estimated cost in the query plan, and Impact.

But where does Impact come from?

Read on to learn where, as well as why you shouldn’t blindly trust that number.

Comments closed

SQL Server 2022 Public Preview on Linux

Amit Khandelwal has notes on SQL Server 2022 on Linux:

In continuation of last week’s announcement of SQL Server 2022 public preview, we are pleased to announce availability of SQL Server 2022 on Linux/Containers for public preview. Here are the details for getting started with the SQL Server 2022 public preview packages on Linux/Containers.

As usual, the officially supported distributions are Red Hat Enterprise Linux and Ubuntu.

Comments closed

T-SQL Language Enhancements in SQL Server 2022

Chad Baldwin checks out what’s new:

I’ve been exicted to play around with some of the new features and language enhancements that are available in SQL Server 2022 so I’ve been keeping an eye on the Microsoft Docker repository for a new 2022 image. Well, they finally added it to Docker Hub! I immediately pulled the image and started playing with it.

I want to focus on the language enhancements as those are the easiest to demonstrate, and I feel that’s what you’ll be able to take advantage of the quickest after upgrading.

Read on for a dozen or so language enhancements. This isn’t as big a change as what 2012 brought but there is a lot of useful stuff in here, as well as more that has been publicly announced like APPROX_PERCENTILE_CONT() (and _DISC(), yeah, but bah humbug).

Comments closed

Data Modeling with Spark–Breaking Data into Multiple Tables

Landon Robinson tokenizes data:

The result of joining the 2 DataFrames – pets and colorsdisplays the nicknamecolor and age of the pets. We went from a normalized dataset where common & recurring values weresubstituted for numeric representation s— to a slightly more denormalized dataset. Let’s keep going!

This is an interesting example of a useful technique but I strongly disagree with Landon about whether this is normalization. Translating a natural key to a surrogate key is not normalizing the data and translating a surrogate key to a natural key (which is what the example does) is not denormalizing the data. A really simplified explanation of the process is that normalization is ensuring that like things are grouped together, not that we build key-value lookup tables for everything. That’s why Landon’s “denormalized” example is just as normalized as the original: each of those attributes describes a unique thing about the pet identified by its (unique) nickname. This would be different if we included things like owner’s name (which could still be on that table), owner’s age, owner’s height, a list of visits to the vet for each pet, when the veterinarians received their licenses, etc.

Comments closed

ML Algorithms a Poor Fit for Predictive Caches

Pete Warden describes an interesting phenomenon:

I’ve been working on a new research paper, and a friend gave me the feedback that he was confused by the statement “memory accesses can be accurately predicted at the compilation stage” for machine learning workloads, and that this made them a poor fit for conventional processor architectures with predictive caches. I realized that this was received wisdom among the ML engineers I know, but I wasn’t aware of any papers that discuss this point. I put out a request for help on Twitter, but while there were a lot of interesting resources in the answers, I still couldn’t find any papers that focused on what feels like an important property for machine learning systems. With that in mind, I wanted to at least describe the issue as best as I can in this blog post, so there’s a trail of breadcrumbs for anyone else interested in how system designs might need to change to accommodate ML.

Read on for the explanation. My reading here is that this is a downside to having general-purpose compute: you run the risk of sub-optimal performance in certain circumstances, like training models using certain types of ML algorithms.

Comments closed

Inferring Data from Its Absence

John Cook lays out an important insight:

One of the Safe Harbor provisions under HIPAA is that data may not contain sparsely populated three-digit zip codes. Sometimes databases will replace sparse zip codes with nulls. But if the same database reports a person’s state, and the state only has one sparse zip code, then the data effectively lists all zip codes. Here the suppressed zip code is conspicuous by its absence. The null value itself didn’t reveal the zip code, nor did the state, but the combination did.

Read the whole thing. This also leads to a swath of security attacks based around unions of information in which each query may data only when X number of people are in it (to prevent us narrowing down to one person) but based on some information I know about the person, I can write a combination of queries to elicit more info about that person. As an example, if I know that a person is left-handed (1/9 of the population), has red hair (around 2% of people), etc., I can find ways to combine these traits to make sure no individual query returns fewer than X results but I can have reasonably high confidence that I can get the individual with enough queries.

Comments closed

Cancelling a Cosmos DB Query

Hasan Savran pushes the big red button:

Sometimes you go to a website and the page does not want to open for whatever reason. You might just click Stop and move on to another page. The stop button does not really stop the request, the server still tries to complete the request and send a response to the client even client does not exist anymore. Rather than clicking on Stop maybe you click Refresh which triggers another request when the first one is not completed yet.

      This scenario applies to all database calls. It might take longer to run a query for a reason and simply you need to wait until you get a response from the database server. If it takes too much time, you might want to cancel the request and try to look at your query to make it faster. You can do that programmatically by using CancellationTokens.

Read on for more information about cancellation tokens and how you can use them.

Comments closed

Unit Testing ADX Functions

David Giard builds some tests:

Our application contains many functions that return data stored in Azure Data Explorer (ADX). We wrote these functions in Kusto Query Language (KQL) and each function returns a set of data based on the arguments passed. Although developers tested these functions as they wrote them, we needed a way to validate that the functions continued to work as the code and the data changed.

Automated Unit testing is an essential part of any application development life cycle. It validates that code works properly and minimizes the risk that future code changes will break existing functionality.

In this article, I will discuss the approach we took in automating the testing of ADX functions.

Click through to see how to use the assert() function and build some tests.

Comments closed

Language Translation via Power BI Field Parameters

Gerhard Brueckl shows off a great use of Power BI Field Parameters:

The current approaches when it comes to data and value translations are more workarounds than actual solutions. They probably work fine for small data models and very specific use-cases but usually fall short in performance, usability or maintainability when implemented on a larger scale enterprise models.

The recently introduced Field Parameters in Power BI give us a bit more flexibility here and another potential solution to implement data and value translations in Power BI.

Click through for an example which shows data in English, Spanish, and French.

Comments closed