Data Loading – Curated SQL

Incremental Data Load into Parquet Files from Python

Published 2025-07-18 by Kevin Feasel

Parquet is a column-oriented open-source storage format increasingly used for “big data” analytics. Yet despite its growing popularity as a native format for data lakes and data warehouses, tools for maintaining these environments remain scarce. Getting data from a SQL environment into Parquet isn’t difficult – but how do we maintain that data over time, keeping it current? In other words, if we already have an existing Parquet file, how can we efficiently append new data to it?

In this article, we’ll introduce the Parquet format, explain some strategies for incrementally updating a Parquet repository, and, with a simple Python script, implement a nightly-feed update process.

Not listed in here is one word that I expected: Delta. Because that’s how we normally do incremental data modification in Parquet data. Either that or Apache Iceberg. Lee shows us a different route that can work.

Ingesting Logs into Microsoft Fabric Real-Time Intelligence via Logstash

Published 2025-07-15 by Kevin Feasel

Surya Teja Josyula and Ramachandran G. use one part of the ELK stack:

Logstash is an open-source data processing tool that enables the collection, transformation, and forwarding of data from a wide variety of sources. It acts as a data pipeline engine, helping organizations manage and streamline the flow of structured and unstructured data across systems.

Whether you’re managing infrastructure logs, application events, or telemetry data, this guide will walk you through setting up a seamless pipeline that bridges raw log data with real-time analytics in Fabric.

Click through for the process.

Loading Data into Snowflake via Python

Published 2025-06-13 by Kevin Feasel

Anil Kumar Moka does a bit of data loading:

In our ongoing exploration of Snowflake data loading strategies, we’ve previously examined how to use pandas with SQLAlchemy to efficiently move data into Snowflake tables. That approach leverages pandas’ intuitive DataFrame handling and works well for many common scenarios where you’re already manipulating data in Python before loading it to Snowflake.

In this article, we’re diving deeper into the Snowflake toolbox by exploring the native Snowflake Connector for Python. While pandas offers simplicity and familiarity, the native connector provides a different set of capabilities focused on precision control and Snowflake-specific optimizations. This article explains you when and how to use this more direct approach for everything from small CSV files to massive datasets that would overwhelm pandas.

Click through for the full article.

Comments closed

Comparing Data Importation Modes in Fabric Semantic Models

Published 2025-05-15 by Kevin Feasel

Marco Russo has a guide:

When I presented “Choosing Between Import Mode, Direct Lake, and Composite Models” at Fabric Conf 2025 in Las Vegas, the room overflowed, and the session was not recorded. I promised to publish the material once the new Direct Lake + Import composite model became available. This post follows the structure of that (now re‑recorded) session.

I prepared a recap for this blog post, but I suggest you watch the full video!

Check out the video and Marco’s guidance.

Comments closed

Lakehouse Table Partitioning in Microsoft Fabric

Published 2024-07-03 by Kevin Feasel

Gilbert Quevauvilliers performs a split:

When loading data, it is always important to load the data with performance and scalability in mind.

For lakehouse tables to return queries quickly and to scale it is essential to load your lakehouse tables with partitions.

What I am going to show you in my blog post today is how to load data into a Lakehouse table where the table will be automatically partitioned by Year/Month/Day.

Click through for the example.

Comments closed

Microsoft Fabric Lakehouse Ingesting CSV vs SQL

Published 2024-07-02 by Kevin Feasel

Reitse Eskens performs a comparison:

This blog will be a quite short one compared to the other blogs as it’s more of an overview to show you the capacity of Fabric ingesting CSV files in their native format into a Lakehouse and ingesting SQL data into a table structure inside the Lakehouse. Simple, straightforward stuff without any form of modification. You could call it bronze, raw, ingestion, temp or whatever your preferred naming convention is.

Why is this important? Well, we still have source systems that can only output to files. Just as we still have customers running on SQL Server 2000, legacy or even antique systems are still running. And it’s important to know how much capacity you use when just ingesting data without any modification.

Read on for the two scenarios, giving you an idea of which one is faster. I’d be interested in a third option, which is reading from Parquet files. My initial expectation would be that it would be even faster and more efficient, depending on the structure of the data.

Comments closed

bcp and UNIX-Style Line Endings

Published 2024-05-10 by Kevin Feasel

Vlad Drumea troubleshoots a problem:

In this blog post I cover a fix for a weird behavior that bcp has when trying to import files that contain Unix line endings.

Read on for a description of what Vlad was trying to do, the problems you can run into when using bcp, how to fix those problems, and a bonus on how DBAtools is pretty neat.

Comments closed

Data Loading with BCP

Published 2024-02-15 by Kevin Feasel

Peter Schott describes a recent bit of messiness:

However, at the time this popped up, my most recent “ticket” was a separate request. I’d been chatting with a client who had mentioned that they were closing an account for one of the SaaS apps they use. The vendor would provide DDL and extract files for import into their own system, but only after the account was closed. We chatted back and forth about some ideas for them to load the data into their own Azure SQL DB instance. At one point, he asked if I’d want to just do it for a small consulting fee. We chatted a bit more and he realized that he really didn’t want to do it.

Read on for the rest of the story. BCP is powerful but always felt finicky to me. Either that or I wasn’t very good at using it. Either could be the case.

Comments closed

Flat File Importation via Azure Data Studio

Published 2023-11-22 by Kevin Feasel

Josephine Bush needs to import a file:

Initially, I thought I would have to use sqlcmd because I’m on a Mac and don’t have SSMS. It turns out Azure Data Studio has a nifty way to import data from flat files – yay!

I’ve used this extension a few times in the past on Linux and Windows and it’s pretty good, especially if you have a fairly straightforward flat file. If it’s a messy file, you’ll still get inscrutable errors. And, as far as data sources go, GIGO.

Comments closed

Route Planning in Postgres

Published 2023-02-27 by Kevin Feasel

Mark Litwintschik plans a journey:

I recently came across a transit route feed aggregator called Transitland. They list feeds from 2,500 operators in 55+ countries around the world. Among these feeds is one for FlixBus, a 12-year-old coach service provider. Below is a route map of their European destinations.

In this post, I’ll import their feed into PostgreSQL, build visualisations of their routes and plan a bus trip from Vienna to Oslo.

Read on for the process.

Comments closed

Category: Data Loading