Storage – Page 2 – Curated SQL

Using PolyBase for Archiving Data

Published 2025-01-23 by Kevin Feasel

One of SQL Server 2022’s new features is something called Data Virtualization. It enables T-SQL to directly query files that reside in Azure object storage or S3-compatible object storage. In my opinion, since SQL Server 2022’s release, it’s one of those underrated capabilities that I think many have glossed over. But I strongly believe that it is insanely useful and you should take a few minutes to learn more!

Read on to learn more. Also, Andy mentions using S3-compatible local storage with PolyBase for local storage. As a spoiler, I have a video coming out on January 28th that covers exactly that same topic, though without the benefit of snappy all-flash storage arrays.

1 Comment

Speed Differences with Separating Data and Log Files

Published 2025-01-16 by Kevin Feasel

Brent Ozar performs a test:

I’ve already explained that no, it doesn’t make your database server more reliable – and in fact, it’s the exact opposite. But what about performance?

The answer is going to depend on your hardware and workload, but let’s work through an example. I’ll take the first lab workload from the Mastering Server Tuning class and set it up on an AWS i3en.2xlarge VM, which has 8 cores, 64GB RAM, and two 2.5TB NVMe SSDs. (This was one of the cheapest SQL-friendly VM types with two SSDs, but of course there are any number of ways you could run a test like this, including EBS volumes.)

I would expect cloud versus on-premises answers to be quite different, because cloud services tend to throttle you hard on how much storage throughput you’re allowed to have. For that reason, the results make perfect sense in AWS (or Azure or GCP for that matter), but unless your on-prem solution has hard throttles on IOPS or throughput because your sysadmins are monsters, the limits of performance would be in how hard you can push the drives or your storage controllers.

Ultimately, the most appropriate answer is to test your systems and not rely on expectations, especially if you’re shifting from on-premises to a cloud (or vice versa).

Comments closed

It’s Probably Not Data Corruption on Disk

Published 2025-01-15 by Kevin Feasel

Andy Yun talks storage:

I cannot tell you how many times I’ve encountered scenarios where “this data looks wrong.” Well… can one ensure that it is being retrieved and displayed correctly from the storage media that it resides on in the first place? Are you viewing/validating the data in question correctly? Whatever client/method you are using to review your data – that is suspect and its integrity is in question.

It is technically possible for bits to flip, but that’s also why we have checksums on disk. I’m sure there are people who have experienced storage corruption that changed just enough to cause problems but not enough to be noticeable, but Andy is right on the money.

Comments closed

Azure SQL Managed Instance Extreme Storage Latency

Published 2024-12-19 by Kevin Feasel

Kendra Little has another caveat emptor message:

What are your stories of unbelievably bad performance from cloud vendors? I’ll go first. For years, Azure SQL Managed Instance’s General Purpose Tier has documented “approximate” storage latency as being “5-10 ms.” This week they added a footnote: “This is an average range. Although the vast majority of IO request durations will fall under the top of the range, outliers which exceed the range are possible.”

How approximate is that 5-10 milliseconds, you might wonder? If you use Azure SQL Managed Instance these days, you will regularly find messages in your SQL Server Error log indicating that all data and log files have experienced latency of up to 60 seconds. At least, 60 seconds is the maximum I’ve observed personally, looking in the logs of several customers’ Managed Instances. Could it be worse? Microsoft hasn’t documented a ceiling. My testing shows that this latency occurs randomly to your workload and is not related to your resource usage: using less IO will not make the errors less likely. You have no way to avoid these storage failures (I don’t see how 15-60 second latency is not a failure), and they can occur anytime.

This is a major strike against SQL Managed Instance General Purpose. Considering the cost of the product, you could buy a new server with direct-attached NVMe storage, have it paid off after one year, have better performance, and get to depreciate the entire expense over a 3-5 year window (something you cannot do with the hardware side of cloud services–you can only depreciate the cost of licensing, assuming you have a 3-year reservation).

2 Comments

Viewing Storage Consumption in Microsoft Fabric

Published 2024-10-30 by Kevin Feasel

Gilbert Quevauvilliers wants to know about storage utilization in Microsoft Fabric:

This blog post will show you how to understand what is consuming your Fabric Storage.

If you want to know how I got this data, please read my previous blog post View all your Storage consumed in Microsoft Fabric – Lakehouse Files, Tables and Warehouses – FourMoo

With this Semantic model below, I could also create alerts to notify based on certain thresholds. For example, if total storage in a single App workspace is more than 100GB send me an alert (This could be done using Power Automate). Or it could be on too many files being stored, or even looking at the Parquet file sizes and if they are too small they would then need to be optimised (for better performance).

Click through for the report.

Comments closed

Viewing Total Storage Consumption in Microsoft Fabric

Published 2024-10-23 by Kevin Feasel

Gilbert Quevauvilliers builds a report:

One of the things I have found when working with my customers in Microsoft Fabric is that there is currently no way to easily view the total storage for the entire tenant.

Not only that, but it would also be time consuming and quite a challenge to then find out what is consuming the storage. Could it be large files or tables or warehouse tables?

In this blog post I will show you how using a Notebook you can get details of the storage across your Microsoft Fabric Tenant.

Click through for an image of the Power BI report and how you can get there.

Comments closed

Ingesting Blob Storage Data into SQL Server

Published 2024-10-16 by Kevin Feasel

Andy Brownsword brings in some data:

We may associate consuming data from Azure Storage with tools like Data Factory or even SSIS as we saw recently. We don’t always need the middle man though.

Here we’ll demonstrate how to use an External Data Source to perform the ingestion directly into SQL Server.

Click through for the solution. As a quick note, the TYPE attribute that Andy uses in CREATE EXTERNAL DATA SOURCE was necessary from SQL Server 2016 through SQL Server 2019, but no longer exists for SQL Server 2022. Instead, for SQL Server 2022, you’d switch the LOCATION to start with abs:// for Azure Blob Storage and PolyBase would infer the type from the protocol.

Comments closed

Reading Parquet Files in R with nanoparquet

Published 2024-10-10 by Kevin Feasel

Stephen Turner reads some data:

In these slides I also learned about the nanoparquet package — a zero dependency package for reading and writing parquet files in R. Besides all the benefits noted above, parquet is much faster to read and write. And, as opposed to saving as .rds, parquet can easily be passed back and forth between R, Python, and other frameworks.

Let’s take a look at how reading and writing parquet files compares with CSV, either with base R or readr.

Stephen shows one of the best-case scenarios for Parquet: lots of data (100 million rows), relatively few columns, no long strings, etc. That leads to a massive improvement over using CSVs, even if you ignore the metadata and formatting benefits. I wouldn’t expect the benefits to be nearly as significant with wide text columns and very little value overlap, but that’s also pretty uncommon for the type of dataset we’re analyzing in R.

Comments closed

Reading Data from Azure Blob Storage in Snowflake

Published 2024-10-07 by Kevin Feasel

Arun Sirpal explains a common architectural pattern:

Let’s go back to data platforms today and I want to talk about a very common integration I see nowadays, Azure Blob Storage linked to Snowflake via a storage integration which then we can access semi structured files via external tables, it is a good combination of technology I have to say.

Click through for an architecture diagram and example of the code you’d need.

Comments closed

Connecting to Azure Storage from SSIS

Published 2024-09-27 by Kevin Feasel

Andy Brownsword makes a connection:

Migrating to the cloud can be disruptive to existing processes. Moving storage to Azure isn’t a simple configuration change for SSIS packages.

SSIS doesn’t have native connections for Azure. That doesn’t mean we need to completely re-engineer the process or change technology though.

How can we take the simple package below and move to using Azure storage?

Read on for the answer. Also, I am 100% on Team SAS Token. They are easy to create and give you a lot of control over who gets access to what.

Comments closed

Category: Storage