Press "Enter" to skip to content

Curated SQL Posts

Working with Parquet Files in Postgres

Craig Kerstiens announces an extension:

Today, we’re excited to release pg_parquet – an open source Postgres extension for working with Parquet files. The extension reads and writes parquet files to local disk or to S3 natively from Postgres. With pg_parquet you’re able to:

  • Export tables or queries from Postgres to Parquet files
  • Ingest data from Parquet files to Postgres
  • Inspect the schema and metadata of existing Parquet files

Code is available at: https://github.com/CrunchyData/pg_parquet/.

Read on for more background on why we built pg_parquet or jump below to get a walkthrough of working with it.

Hey, that’s my job to tell people to read on to learn more!

Comments closed

Avoid Storing Files in Databases

Joey D’Antoni explains why you almost never want to use FILESTREAM:

Yesterday, I accidentally walked into a discussion on LinkedIn about the merits of Filestream in SQL Server. If you aren’t familar with Filestream, consider yourself lucky, it was a feature that was added to SQL Server in the short timeframe that people thought it was a good idea to use databases for file storage, and before the enlightened times when object storage became a thing. I first remember blobs being a thing in Oracle 8i, where at least you had the ability to store them in a tablespace with a larger block size than 8k, and had a dedicated area of the buffer pool with that larger block size that you could dedicate to that blob tablespace.

Joey has the right of things. There are rare exceptions where it could make sense to store files in databases. My best example involves storing ML models that we use in SQL Server ML Services, simply because of how difficult it is to read anything off of disk via ML Services. But that’s a real edge case.

Comments closed

Migrating to Azure PostgreSQL Flexible Server from Single Server

Josephine Bush performs a migration:

Why Migrate to Flexible Server?

  • High availability and disaster recovery: Flexible Server provides higher availability with zone-redundant architecture.
  • Customizable maintenance windows: More control over when updates and maintenance tasks occur.
  • Performance improvements: Fine-tuned scaling and performance adjustments without downtime.
  • Enhanced security: With VNet integration and more advanced networking options.

Read on to learn more about by when you have to migrate and how you can perform the migration.

Comments closed

Data Masking in Snowflake: Tagging Policies

Kevin Wilkie creates some tags:

In our last post, we discussed data masking and setting it up for specific columns. Now, I don’t know about y’all, but I deal with hundreds of tables with several columns in each on a daily basis that truly need data masking. If I have to go through and set up a masking policy one by one, I might go crazier than I normally am.

Let’s say that I have an existing table, Employee, in my database that I want to apply a policy. As I’m sure many of you know, any table with Employee data should have at least some of its data masked. The first thing that we need to do is to set up the tag that we’ll be using for, in our case, a column named Salary.

Read on to learn more about how to create tags and why you might want to.

Comments closed

Troubleshooting Chains of Common Table Expressions

Jared Westover calls me out:

I started using Common Table Expressions (CTEs) in Oracle before Microsoft SQL Server and found them easier to read than a derived table. Something about reading from the top down just clicked in my brain. As the years have passed, I’m less of a fan for one reason: troubleshooting a chained CTE with several links is hard—especially when it’s someone else’s code. Even the ones I wrote years ago fill me with unease when something breaks. Is there an easier way to break down each link in the chain when things go wrong?

This is, I would argue, the biggest inconvenience around using common table expressions. The fact that SQL Server cannot materialize the contents of CTEs is likely a more prevalent problem, but challenges in how you can troubleshoot the individual parts of common table expressions are real.

Comments closed

Thoughts on Blobs in PostgreSQL

Stefanie Janine Stölting shares some information on using blobs in PostgreSQL:

PostgreSQL does not have a BLOB data type as specified in the SQL standard. The nearest implementation is the data type BYTEA. Since PostgreSQL 9.0 it does handle data by standard as hexadecimal data.

Click through to learn more about how that data type works, some of the limitations around it, and why it’s not a smart idea to store blob-like data in PostgreSQL The arguments are very similar to not doing this in SQL Server, and the counter-arguments are similarly there and exceptional in nature.

Comments closed

Configuring the Fabric Service Principal to Support Storage APIs

Gilbert Quevauvilliers grants some permissions:

This blog post explains how to configure access for my Service Principal to interact with the Azure Storage API to use the API to get details for Microsoft Fabric Storage.

This is part of a blog post series where I am going to show you how to “View Total Storage consumed in Microsoft Fabric”

When I started this blog post I realized that I first need to explain how to configure the Service Principal authentication to interact with the Azure Storage API permissions. This is because in my notebook these steps are required for the notebook to run successfully.

Read on to find out how.

Comments closed

Regarding pg_dump as a Backup Tool

Robert Haas shares some thoughts:

Recently, I’ve been hearing a lot of experienced PostgreSQL users reiterate this line: “pg_dump is not a backup tool.” In fact, the documentation has recently been updated to avoid saying that it is a backup tool, to widespread relief. Experienced PostgreSQL users and developers have been publicly called out for having the temerity to assert that pg_dump is, in fact, a backup tool. I find this narrative deeply frustrating, for two reasons.

Click through for those reasons.

Comments closed

A Primer on Pandas

Rajendra Gupta talks about Pandas:

Have you heard about Pandas in Python? It is widely used open-source library for analyzing and manipulating data in the Python programming language. Let’s explore it with use cases and examples.

Click through for an overview of the library. Pandas isn’t the quickest performer as your data sets get large, but for ease of use on moderately-sized datasets (up to hundreds of thousands of rows, or maybe millions if you manage things well), it does a good job.

Comments closed