Storage – Page 8 – Curated SQL

In this post, I will walk you through how to set up MinIO, so you can use it to work with SQL Server 2022’s s3 object integrations. Working with s3 and SQL Server requires a valid and trusted TLS certificate. This can be a pain for some users and environments. So I’m writing this post so you can get off the ground running with this new feature set in SQL Server 2022. The certificate we’re working with here is self-signed. You could get a real certificate for your environment, and that’s encouraged. But this walk-through intends to get you up and running fast so that you can test out SQL Server’s s3 object integrations. We’re using MinIO’s free GNU AGPL v3 edition and running it in a docker container for our s3 compatible object storage and SQL Server 2022 CTP 2.0, which is also running in a container.

Click through for the demo, in which Anthony sets everything up and then backs up a database in SQL Server 2022 to MinIO.

Comments closed

The Benefits of Parquet

Published 2022-05-30 by Kevin Feasel

Maria Zakourdaev explains why the Parquet file format is so useful:

Parquet files organize data in columns, while CSV files organize data in rows.

Columnar storage allows much better compression so Parquet data files need less storage, 1 TB of CSV files can be converted into 100GB of parquet files – which can be a huge money saver when cloud storage is used. This also means that scanning parquet file is much faster than scanning CSV files – fewer data would be scanned and there is no need to load unneeded columns into memory and aggregations will run faster. Parquet files contain both data and metadata, information about data schema and structure. When you load the file, having metadata helps the querying tool define proper data types.

Click through for an example of when Parquet makes sense. It’s not the best format for everything—it’s a columnar file format, so writes are typically slower than row-store formats like CSV or Avro—but it and ORC are outstanding for analytical processing, not least because of the metadata these formats contain.

Comments closed

Seeding AG Replicas from Snapshots in SQL Server 2022

Published 2022-05-27 by Kevin Feasel

Anthony Nocentino is excited about using storage snapshots in SQL Server 2022:

But what if I told you that you could seed your Availability Group from a storage-based snapshot and that the re-seeding process can be nearly instantaneous?
In addition to saving you time, this process saves your database systems from the CPU, network, and disk consumption that comes with direct seeding and using backups and restores to seed.
This process described in this post is imlemented on Pure Storage’s FlashArray and works cloud scenarios on Pure’s Cloud Block Store.

Click through to see how.

Comments closed

Azure Shared Disk with Zone-Redundant Storage

Published 2022-05-12 by Kevin Feasel

Dave Bermingham runs some tests:

What makes this interesting is that you can now build shared storage based failover cluster instances that span Availability Zones (AZ). With cluster nodes residing in different AZs, users can now qualify for the 99.99% availability SLA. Prior to support for ZRS, Azure Shared Disks only supported Locally Redundant Storage (LRS), limiting cluster deployments to a single AZ, leaving users susceptible to outages should an AZ go offline.
There are however a few limitations to be aware of when deploying an Azure Shared Disk with ZRS.

Dave also checks to see how their performance compares to locally-redundant storage.

Comments closed

Building S3 Data Pipelines — The Tools

Published 2022-04-01 by Kevin Feasel

Chris Adkin continues a series:

In my last post I outlined a number of architectural options for solutions that could be implemented in light of Microsoft retiring SQL Server 2019 Big Data Clusters, one of which was data pipelines that leverage Python and Boto 3. Before diving into these things in greater detail, lets take a recap on what S3 is.

Click through for a simple data pipeline example.

Comments closed

Power BI Dataflows and Storage Considerations

Published 2022-03-24 by Kevin Feasel

Teo Lachev has some things for us to consider:

Over the past few years, the BI industry has come up with new file formats, such as Parquet, ORC, and Avro, which are widely used today. To facilitate its vision for cross-industry data integration, Microsoft introduced a few years ago the Common Data Model (CDM) and CDM Folders. Power BI dataflows output CSV files to CDM folders and each table is saved in its own folder. You can bring your own data lake to directly access these files. If do so, you’ll find the following folder structure:
Although accessing the dataflow files might open all sorts of data integration scenarios, here are some things to watch for concerning the dataflow output:

Read on for five things.

Comments closed

Zero Records but Lots of Space Used

Published 2022-03-01 by Kevin Feasel

Jeff Iannucci solves a riddle:

Anyhow, it’s worthwhile to occasionally review the tables in a database to see which ones are growing every day, using the most space.
But what if during a review you see the largest table looks like this?
That’s around 24 GB of sweet drive space allocated for 0 records. But…how?
Let me show you how.

Click through to see how. My initial thought was LOB craziness but Jeff’s example doesn’t even need that.

Comments closed

Storage Pools and Volumes

Published 2022-02-21 by Kevin Feasel

John Morehouse illuminates us on storage:

I think there are a couple of lines of thought related to this. I’m one person with a NAS so I don’t need multiple volumes. I can certainly get by with a single volume on each storage pool and this will simplify management of things.
If you were working with enterprise grade storage in a corporate environment, having multiple volumes will make sense. I think of this as carving up disk space for production SQL Servers where each drive letter corresponds to a given volume which resides on a given storage pool. A volume can serve multiple folders.

You know a blog post is going to be good when it starts with “In hindsight, I should have done this differently.”

Comments closed

Addressable Disk Space and File Counts in SQL MI General Purpose

Published 2022-01-06 by Kevin Feasel

Niko Neugebauer has been busy:

In the previous blog posts in the SQL MI How-Tos we have already touched on the aspect of SQL MI reserved and available Disk Space, but as in everything – there is so many things to add and expand. In this post we shall focus on the General Purpose service tier and the remote disk storage that is used in this service tier. Besides the explicit limits of the addressable space that is connected to the number of CPU vCores, there are important aspects of the remote storage that will limit the number of database files that can be located there.
If you are interested in other posts on how-to discover different aspects of SQL MI – please visit the http://aka.ms/sqlmi-howto, which serves as a placeholder for the series.

Click through to see how it all fits together with Managed Instances.

Comments closed

Parallel Scans and Blob Storage Slowness

Published 2021-12-23 by Kevin Feasel

Joe Obbish goes beyond the obvious reason:

Upon reading the title, you may be thinking that of course parallel scans will be slow in the cloud. Cloud storage storage simply isn’t very fast. I would argue that there’s a bit more to it.

Click through for a deep dive with some advice on what you might do to fix a specific (but common) scenario.

Comments closed

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

Category: Storage

Using S3 Object Storage in MinIO with SQL Server 2022

The Benefits of Parquet

Seeding AG Replicas from Snapshots in SQL Server 2022

Azure Shared Disk with Zone-Redundant Storage

Building S3 Data Pipelines — The Tools

Power BI Dataflows and Storage Considerations

Zero Records but Lots of Space Used

Storage Pools and Volumes

Addressable Disk Space and File Counts in SQL MI General Purpose

Parallel Scans and Blob Storage Slowness