Press "Enter" to skip to content

Day: March 29, 2023

Testing Performance of File Formats in R

Steven Sanderson performs some tests:

We can save the generated matrix in different file formats using different functions in R. Here are the functions we will use for each file format:

  • CSV: write.csv()
  • RDS: saveRDS()
  • FST: write_fst()
  • Arrow: write_feather()

Steve then has a follow-up around compressed data:

In this post I create a square matrix and then convert it to a data.frame (2,000 rows by 2,000 columns) and then saved it as a gz compressed csv file. The benchmark compares different R packages and functions, including base Rdata.tablevroom, and readr, and measures their relative speeds based on the time it takes to read in the .csv.gz file.

There’s not a direct comparison between the two posts, as the second matrix is larger than the first, though even with that caveat in mind, this post lets you see how much extra processing occurs to gunzip the data before reading it.

Comments closed

Generating Artificial Data in Databricks

Ben Hazan needs some fake data:

While attending the SQLBits 2023, I took part in André Kamman’s session about “Generate test data quick, easy and lots of it with the Databricks Labs Data Generator”.

In this blog, I will share with you my insights about the DataBricks Data Generator library and I’ll give an example.

Synthetic data is a valuable resource for data scientists, engineers, and analysts who need to test, benchmark, or demonstrate their solutions without compromising sensitive or confidential information. However, generating realistic and relevant synthetic data can be challenging and time-consuming.

That’s why Databricks Labs has developed a Python library called dbldatagen that can help you create large-scale synthetic data sets using Spark.

Click through to learn more about the library and see how you can use to to generate arbitrary amounts of artificial data following certain constraints.

Comments closed

Accelerated Database Recovery in SQL Server 2022

Perry Skountrianos takes us through some recent changes:

In SQL Server 2019 (15.x), the ADR cleanup process is single threaded within a SQL Server instance. Beginning with SQL Server 2022 (16.x), this process uses multi-threaded version cleanup (MTVC), that allows multiple databases under the same SQL Server instance to be cleaned in parallel.

MTVC is enabled by default in SQL Server 2022 and uses one thread per SQL instance.

Read on to see how you can change that, as well as additional product updates.

Comments closed

Rewriting Scalar UDFs and NULL Results

Erik Darling jumps out of the time machine to warn us, but we have no idea what he’s talking about so we ignore the warning and end up doing the thing he warned us not to do, causing us to need to send him into a time machine to warn us not to do it:

I think I have probably spent 500 hours of my life rewriting T-SQL Scalar UDFs to avoid all the performance problems associated with them.

The obvious choice is the Inline Table Valued Function, which has fewer performance issues baked in. For the kids out there: they don’t spill trauma.

But getting the rewrite right can be tricky, especially when it’s possible for the function to return NULL values.

Click through for the example.

Comments closed

Changing the Browser SSMS Uses

Meagan Longoria doesn’t want the built-in browser:

Did you know that you can change the browser used by SQL Server Management Studio to authenticate using Azure Active Directory to a SQL database in Azure?

I had been experiencing serious delays with the window that pops up to accept my credentials taking 30 seconds or more to populate. I also once got a warning that the browser I was using was old.

Click through to see how to resolve annoyances around SSMS’s built-in browser and change to the default you use for everything else.

Comments closed

Choosing a SKU for Azure Data Explorer

Brian Bønk makes a choice:

When creating the clusters from the Azure portal, you are presented with 3 options when choosing the compute specification.

The compute specification is the method of setting up the clusters for the specific workload you are planning to put on the Kusto cluster.

The portal gives you these three options:

Read on for the options, as well as some recommendations on when you might choose each.

Comments closed

Restoring an Azure SQL Database

Andrea Allred recovers from a mistake:

Recently, the wrong table got dropped and we needed to bring it back. I had never done a restore in an Azure Managed Database before so I learned something really fast.

Click through for the process. And yeah, it is quite easy, though I’ve noticed that restore times are a bit slower than if you were using local hardware on-premises.

One neat trick with database restores in Azure SQL DB: you can’t restore over an existing database, something a client wanted me to do last week. What you can do, however, is restore the database under a new name, so we might have messedupdb and then messedupdb_restore. Well, in this case, messedupdb had no changes since “the incident,” so what we were able to do was rename messedupdb to messedupdb_dropme and rename messedupdb_restore to messedupdb. Azure SQL DB happily rolls on with this and after ensuring that the database was now in prime condition, we could drop the old version. It’s a little more complex than simply restoring over the existing database, but all the relevant metadata Azure SQL DB needs stayed in sync along the way, so the process was smooth.

Comments closed

Concatenation per Group in MySQL

Rahul Mehta asks the big questions:

In this article, we are going to discuss how to aggregate the article at a row level. MySQL provides a function called “group_concat” to perform row-level concatenation. Before we go ahead and learn how to do so, let us first understand:

  • Why do we need it?
  • Where do we need it?
  • When to use it?

Fortunately, Rahul then answers these questions; otherwise, it’d be a pretty short article. The T-SQL analog to this is STRING_AGG(), though the syntax and behavior is not precisely the same.

Comments closed