Press "Enter" to skip to content

Curated SQL Posts

Storing the Basis of Calculations, Not Results

Vladimir Khorikov hits on a topic of particular interest:

On the front-end, the duration is represented as 1h 47m. But how should we store it in our database? What data type should we use?

We could keep the duration as a 1h 47m string, just as we render it on the screen, but what if we later decide to change the string format? We might need to display it as 1:47 or 107 minutes instead. Do we then parse all the existing strings and convert them into a new format?

I like the thrust of this article. Two things I think people forget about in data modeling are ledgers and events, instead trying to store the results of these. Ledgers and events can be slower—though there are things we can do to pre-calculate more user-friendly results and speed up the process—but they provide you auditability, flexibility in how you want to calculate and display the information, and the ability to correct errors over time.

Comments closed

SQL Alerts on Potential Attacks

David Fowler keeps his head on a swivel:

But let’s be honest, the above attacks are just going to be opportunistic hackers probably using an automated tool to hunt out open servers and try to exploit them, a bit like the guy walking down the road and trying all the car doors in the hope that someone will have left their unlocked. It someone’s specifically targeting your servers, the chances are that they’ll be a bit more sophisticated.

This is where things get a little more troublesome and where SQL doesn’t really give us any tools to help.

But there are warning signs that we can look out for and these can be certain error messages in the logs. There are three in particular that I see as red flags that something amiss may be going on.

Click through for those three warning signs and how you an create SQL Server alerts upon logging these messages. The errors aren’t guaranteed to be an attacker, though in reading them, I’d expect a high signal to noise ratio on them.

Comments closed

Testing Performance of File Formats in R

Steven Sanderson performs some tests:

We can save the generated matrix in different file formats using different functions in R. Here are the functions we will use for each file format:

  • CSV: write.csv()
  • RDS: saveRDS()
  • FST: write_fst()
  • Arrow: write_feather()

Steve then has a follow-up around compressed data:

In this post I create a square matrix and then convert it to a data.frame (2,000 rows by 2,000 columns) and then saved it as a gz compressed csv file. The benchmark compares different R packages and functions, including base Rdata.tablevroom, and readr, and measures their relative speeds based on the time it takes to read in the .csv.gz file.

There’s not a direct comparison between the two posts, as the second matrix is larger than the first, though even with that caveat in mind, this post lets you see how much extra processing occurs to gunzip the data before reading it.

Comments closed

Generating Artificial Data in Databricks

Ben Hazan needs some fake data:

While attending the SQLBits 2023, I took part in André Kamman’s session about “Generate test data quick, easy and lots of it with the Databricks Labs Data Generator”.

In this blog, I will share with you my insights about the DataBricks Data Generator library and I’ll give an example.

Synthetic data is a valuable resource for data scientists, engineers, and analysts who need to test, benchmark, or demonstrate their solutions without compromising sensitive or confidential information. However, generating realistic and relevant synthetic data can be challenging and time-consuming.

That’s why Databricks Labs has developed a Python library called dbldatagen that can help you create large-scale synthetic data sets using Spark.

Click through to learn more about the library and see how you can use to to generate arbitrary amounts of artificial data following certain constraints.

Comments closed

Accelerated Database Recovery in SQL Server 2022

Perry Skountrianos takes us through some recent changes:

In SQL Server 2019 (15.x), the ADR cleanup process is single threaded within a SQL Server instance. Beginning with SQL Server 2022 (16.x), this process uses multi-threaded version cleanup (MTVC), that allows multiple databases under the same SQL Server instance to be cleaned in parallel.

MTVC is enabled by default in SQL Server 2022 and uses one thread per SQL instance.

Read on to see how you can change that, as well as additional product updates.

Comments closed

Rewriting Scalar UDFs and NULL Results

Erik Darling jumps out of the time machine to warn us, but we have no idea what he’s talking about so we ignore the warning and end up doing the thing he warned us not to do, causing us to need to send him into a time machine to warn us not to do it:

I think I have probably spent 500 hours of my life rewriting T-SQL Scalar UDFs to avoid all the performance problems associated with them.

The obvious choice is the Inline Table Valued Function, which has fewer performance issues baked in. For the kids out there: they don’t spill trauma.

But getting the rewrite right can be tricky, especially when it’s possible for the function to return NULL values.

Click through for the example.

Comments closed

Changing the Browser SSMS Uses

Meagan Longoria doesn’t want the built-in browser:

Did you know that you can change the browser used by SQL Server Management Studio to authenticate using Azure Active Directory to a SQL database in Azure?

I had been experiencing serious delays with the window that pops up to accept my credentials taking 30 seconds or more to populate. I also once got a warning that the browser I was using was old.

Click through to see how to resolve annoyances around SSMS’s built-in browser and change to the default you use for everything else.

Comments closed

Choosing a SKU for Azure Data Explorer

Brian Bønk makes a choice:

When creating the clusters from the Azure portal, you are presented with 3 options when choosing the compute specification.

The compute specification is the method of setting up the clusters for the specific workload you are planning to put on the Kusto cluster.

The portal gives you these three options:

Read on for the options, as well as some recommendations on when you might choose each.

Comments closed

Restoring an Azure SQL Database

Andrea Allred recovers from a mistake:

Recently, the wrong table got dropped and we needed to bring it back. I had never done a restore in an Azure Managed Database before so I learned something really fast.

Click through for the process. And yeah, it is quite easy, though I’ve noticed that restore times are a bit slower than if you were using local hardware on-premises.

One neat trick with database restores in Azure SQL DB: you can’t restore over an existing database, something a client wanted me to do last week. What you can do, however, is restore the database under a new name, so we might have messedupdb and then messedupdb_restore. Well, in this case, messedupdb had no changes since “the incident,” so what we were able to do was rename messedupdb to messedupdb_dropme and rename messedupdb_restore to messedupdb. Azure SQL DB happily rolls on with this and after ensuring that the database was now in prime condition, we could drop the old version. It’s a little more complex than simply restoring over the existing database, but all the relevant metadata Azure SQL DB needs stayed in sync along the way, so the process was smooth.

Comments closed