Curated SQL – Page 281 – A Fine Slice Of SQL Server

String Regularization and Tokenization in SQL Server

Published 2023-10-04 by Kevin Feasel

The Stack Exchange network logs a lot of web traffic – even compressed, we average well over a terabyte per month. And that is just a summarized cross-section of our overall raw log data, which we load into a database for downstream security and analytical purposes. Every month has its own table, allowing for partitioning-like sliding windows and selective indexes without the additional restrictions and management overhead. (Taryn Pratt talks about these tables in great detail in her post, Migrating a 40TB SQL Server Database.)

It’s no surprise that our log data is massive, but could it be smaller? Let’s take a look at a few typical rows. While these are not all of the columns or the exact column names, they should give an idea why 50 million visitors a month on Stack Overflow alone can add up quickly and punish our storage:

Click through for one technique Aaron has to tighten things up a bit.

Comments closed

Distinct Counts in Power BI and KQL

Published 2023-10-04 by Kevin Feasel

Dany Hoter needs a distinct count:

Calculating distinct counts on massive distributed datasets is not trivial.

Kusto (aka Azure Data Explorer/KQL database in Fabric) dcount and dcountif functions use a special algorithm to return an estimate of distinct counts

The new functions count_distinct and count_distinctif were recently added to calculate exact distinct counts. These two functions are much more expensive than the original ones.

Read on for more details on how this all works.

Comments closed

Apache Kafka Consumer Group Strategy

Published 2023-10-03 by Kevin Feasel

Lucia Cerchie gives us some advice:

Ever dealt with a misbehaving consumer group? Imbalanced broker load? This could be due to your consumer group and partitioning strategy!

Once, on a dark and stormy night, I set myself up for this error. I was creating an application to demonstrate how you can use Apache Kafka® to decouple microservices. The function of my “microservices” was to create latte objects for a restaurant ordering service. It was set up a little like this:

I wanted to implement this in Kafka by using consumers, each reading from a common coffee topic, but with their own partition. Now this was a naive approach. Why?

Click through to learn the reason, as well as some of the mechanics of how consumer groups work.

Comments closed

Plotting Decision Trees in R

Published 2023-10-03 by Kevin Feasel

Steven Sanderson builds a tree:

Decision trees are a powerful machine learning algorithm that can be used for both classification and regression tasks. They are easy to understand and interpret, and they can be used to build complex models without the need for feature engineering.

Once you have trained a decision tree model, you can use it to make predictions on new data. However, it can also be helpful to plot the decision tree to better understand how it works and to identify any potential problems.

In this blog post, we will show you how to plot decision trees in R using the rpart and rpart.plot packages. We will also provide an extensive example using the iris data set and explain the code blocks in simple to use terms.

Read on to see an example of how to do this.

Comments closed

Data Center Staffing Disasters

Published 2023-10-03 by Kevin Feasel

Steve Jones reads an after-action report:

There was a failure recently at an Azure data center in Australia when a utility power sag caused equipment to trip offline at one of the Azure data centers in Australia. You can read about it here, but essentially the headline is that there were only three people on site when the incident occurred, and that caused them to be unable to restart the equipment in time before an outage occurred.

Read on to learn more about why this failed and what Steve has seen in the wild.

Comments closed

Options for Running Jobs against Azure SQL DB

Published 2023-10-03 by Kevin Feasel

Anthony Norwood replaces on-prem SQL Agent jobs:

Both SQL Server on Azure VM and Azure SQL Managed Instance provide you with SQL Server Agent and therefore the capability to run scheduled tasks against your databases, so when we’re talking about being able to run jobs we’re only considering Azure SQL Database as needing guidance – some of the suggestions in the following paragraphs can also apply to all these options of SQL Server, but perhaps not as necessary.

We’re going to provide you with four options for how you might be able to still run your favourite SQL Agent Jobs against an Azure SQL Database, each of which come with their own advantages and disadvantages – one not mentioned is Data Factory, sometimes referred to as SSIS in the cloud, and this is because we’re trying to focus on some options that may be more comfortable to people who have never built an SSIS package before.

Read on for the four options Anthony has for us.

Comments closed

MERGE is (Kinda) Okay

Published 2023-10-03 by Kevin Feasel

Hugo Kornelis performs a survey:

The MERGE statement compares source and target data, and then inserts into, updates, and deletes from the target table, all in a single statement. This statement was introduced in SQL Server 2008. I liked it, because it allows you to replace a set of multiple queries with just one single query. And while a statement with that many options necessarily has a more complex syntax, I still believe that, in most cases, a single MERGE statement is easier to read, write, and maintain, than a combination of at least an INSERT and an UPDATE, often a DELETE, and sometimes first a SELECT into a temporary table if the source is complex.

Click through for a review of a variety of problems people have had in the past. It surprised me a bit when I learned how few of these issues were still active problems caused by MERGE.

Comments closed

Ordering Boxplots in Base R

Published 2023-10-02 by Kevin Feasel

Steven Sanderson wants things in the right order:

The first example shows how to order the boxplots based on a specific order for the variable being plotted. We will use the built-in airquality dataset in R. The following code shows how to order the boxplots based on the following order for the Month variable: 5, 8, 6, 9, 7.

Click through to see that result, as well as two other examples.

Comments closed

Tiered Storage in Apache Kafka

Published 2023-10-02 by Kevin Feasel

Matthew de Detrich explains the value behind tiered storage in Apache Kafka:

Tiered Storage is arguably one of the most sought-after features of Kafka 3.6, allowing Kafka’s core data to be stored in other locations, such as object storage, in addition to hard disks in a transparent manner, without any changes to Kafka’s producers or consumers. The Kafka brokers control whether the data is stored on local disks, fast but expensive and limited, or in alternative storage places, such as Amazon S3. When Tiered Storage is properly configured, it means you can have the best of both worlds: recent data is stored on local fast disks (as is currently), and older, less frequently accessed data can be stored elsewhere where it’s cheaper and space requirements are less of a concern (sometimes unlimited!)

Read on to learn more about the official version of tiered storage, as well as a forward port of two prior implementation attempts to Kafka version 3.3.

Comments closed

The Net Effect of Pausing Microsoft Fabric Capacity

Published 2023-10-02 by Kevin Feasel

Nicky van Vroenhoven calls a timeout:

After a while it got me thinking:

What does it actually mean when I pause a Fabric capacity?

What will stop working?

What can I still do and won’t stop working?

Read on for the answer to this question.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Curated SQL Posts