Press "Enter" to skip to content

CETAS to Parquet Files in Azure SQL Managed Instance

Michael Bourgon gives CETAS a chance:

TL;DR – the below lines will allow you to query a table on your MIcreating Parquet files in Azure blob storageAnd you can query it! Next up is partitioning over time, etc, etc. But this is freaking fantastic. I have a python script I wrote that does it, but it’s nowhere as nice/easy as this.

Why do you care? Because it’s a fantastically easy way to archive older data to blob storage, and I suspect (need to test) that if you do it right, you can then have it go to cool/archive storage via a lifecycle setup, so that if you need it much later, you can.

Yep, this is historically one of the best use cases for PolyBase. Unfortunately, we can’t do this in SQL Server 2022, though you can in pre-2022 versions using the Hadoop process. Given that it’s now available in SQL MI, I wouldn’t be too shocked to see it on-premises at some point, with the big question being in SQL Server 2022 or vNext.

2 Comments

  1. George Walkey
    George Walkey 2023-04-27

    Couldnt get CETAS to work in 2022
    Querying Parquet files in Storage works fine, Kinda slow

    I dont see the appeal of Parquet other than its the “CSV” of detachable columstore data

    • Kevin Feasel
      Kevin Feasel 2023-04-27

      Here’s the quick list:
      – That it is columnar is part of the benefit: you can get significant compression savings over CSV. For normal data, it’s about a 2-3x improvement; for well-designed fact data, it can be a 10x improvement.
      – Parquet is an open format, so your data can move between platforms without vendor lock.
      – Parquet includes metadata in the file header: row counts, data types, data lengths. This means that some operations can happen without scanning the data. Also, it means you don’t have to worry about delimiter overlap (oh, I have a comma in my dataset, how droll!) or how different applications treat quoted identifiers differently.
      – Because Parquet is columnar, aggregation operations can be much faster, especially if you only need a limited number of columns from the file. If I need to get a sum of dollar amounts from a CSV or rowstore file, I need to scan each file and for each row, break out each column (which may involve setting data types) to find the one column I need, and finally sum. With Parquet, the columns are already discrete, so no need to read unnecessary columns.

      One of the biggest downsides to Parquet is that updates aren’t easy, like they are in CSV or other rowstore formats. I can’t just add a new row and be done with it. That’s where the concept of delta tables comes into play: Parquet base files, JSON difference files acting as a transaction log, and a process to merge changes periodically.

      When we’re talking about data stored in files, Parquet is one of the best options, particularly for large amounts of data.

Comments are closed.