Press "Enter" to skip to content

Day: February 18, 2021

The Data Mesh

James Serra explains what a data mesh is:

Its goal is to treat data as a product, with each source having its own data product manager/owner (who are part of a cross-functional team of data engineers) and being its own clearly-focused domain that has an autonomous offering, becoming the fundamental building blocks of a mesh, leading to a domain-driven distributed architecture. Note that for performance reasons, you could have a domain that aggregates data from multiple sources. Each domain should be discoverable, addressable, self-describing, secure (governed by global access control), trustworthy, and interoperable (governed by an open standard). Each domain will store its data in a data lake and in many cases will also have a copy of some of the data in a relational database (see Data Lakehouse defined for why you still want a relational database in most cases).

I’ll have to think more about this before I’m convinced. I’ll also need to think about the Aristotelian opposite of the data mesh.

Comments closed

Kafka in .NET

Diogo Souza walks us through building an application which produces and consumes messages using Apache Kafka:

Kafka is just the broker, the stage in which all the action takes place. The producers send messages to the world while the consumers read specific chunks of data. How do you differentiate one specific portion of data from the others? How do consumers know what data to consume? To understand this, you need a new actor in the play: the topics.

Kafka topics are the channels, the carriage that transport messages around. Kafka records produced by producers are organized and stored into topics.

This is a nice overview of Kafka followed by the basics of building a consumer and a producer in C#. I just wish that there was more community usage of Kafka so that the Confluent .NET driver would include some of the really cool stuff they’ve added to Kafka over the past couple of years.

Comments closed

How to Update Statistics Manually

Matthew McGiffen takes us through the process of updating statistics:

At the heart of all the methods we’ll look at is the UPDATE STATISTICS command. There are a lot of options for using this command, but we’ll just focus on the ones you’re most likely to use. For full documentation here is the official reference:
https://docs.microsoft.com/en-us/sql/t-sql/statements/update-statistics-transact-sql

Even if you have an automated system, knowing how to update statistics is a great thing because you might need to run a one-off update to help a poorly-performing query. Or you’re using PolyBase, which doesn’t have the capability to perform statistics updates automatically because the data isn’t actually in SQL Server.

Comments closed

Building a Function to Get the Next Date by Date Name or Offset

Louis Davidson has a function for us:

As I have been building my Twitter management software, I have been doing a lot more ad-hoc, repetitive coding using T-SQL directly. When I was generating new tweets for upcoming days, a bit of the process that got old quick was getting the date for an upcoming day (the primary key for my tweet table is the date, the type of the tweet, and a sequence number). After having to pick the date of next Tuesday… I had to write some more code (because a true programmer doesn’t do repetitive work when code can be written… even if sometimes the code doesn’t save you time for days or weeks.

So this following function was born from that need, and it is something I could imagine most anyone using semi-regularly, especially when testing software. 

This is definitely fancy. My inclination would be to create a calendar table, as that’ll solve this particular issue as well as other complex variants (like, I want the next Tuesday which doesn’t fall on a holiday).

Comments closed

Searching Text which Begins with a Wildcard

Chad Callihan is looking for some data:

Searching for a value or group of values with a wildcard is more than just putting a % on both sides of a text string. If you know you’re looking for all strings in a name field that start with the name “Chad” then you are you really shooting yourself in the foot by using ‘%Chad%’ instead of ‘Chad%’ in your query. SQL Server is going to be scanning the table instead of being able to use an index to seek to the data. While that may work to get your result, it’s likely going to take longer and be more invasive than needed. I want to go through an example of how using a reversed column can improve SQL Server’s ability to build a better execution plan.

Read on to see how a computed reversal column can help. For more complex scenarios, an n-gram table (for example, a trigram) might help, though there’s a lot of setup involved there.

Comments closed

So You’ve Run Out of Memory

Randolph West explains how the buffer pool handles low-memory situations:

One of the bigger clichés in the data professional vocabulary (behind “it depends”) is that you always give SQL Server as much RAM as you can afford, because it’s going to use it. But what happens when SQL Server runs out of memory?

Recently a question appeared on my post about how the buffer pool works, asking the following (paraphrased):

What happens if a data page doesn’t exist in the buffer pool, and the buffer pool doesn’t have enough free space? Does the buffer pool use TempDB, [and] does TempDB put its dirty pages into the buffer pool?

This is an excellent question (thank you for asking!). I spent 30 minutes writing my reply and then figured it would make a good blog post this week if I fleshed it out a little.

Read the whole thing.

Comments closed

Index Creation with DROP_EXISTING

Monica Rathbun takes us through the DROP_EXISTING option when modifying an index:

When you are making changes to an existing Non-Clustered index SQL Server provides a wide variety of options. One of the more frequently used methods is DROP EXISTING; in this post you will learn all about that option. This option automatically drops an existing index after recreating it, without the index being explicitly dropped. Let us take a moment understand the behavior of this choice.

What I really want is DROP_IF_EXISTS. I want idempotent commands: if I run it once or a thousand times, I end up in the same state whether there was an index there at the start or not (or if attempt #793 failed due to running out of sort space in tempdb or something, leaving me with no index). DROP_EXISTING is only idempotent if the index already existed, but then you have to ask, why is it important if an index of that name is already there? The important part of the statement is that I want an end state which includes this index in this form.

Comments closed