ETL / ELT – Curated SQL

Data Cleansing Tips in Pandas

Published 2025-07-15 by Kevin Feasel

Data preparation is one of the most time-consuming parts of any data science or analytics project, but it doesn’t have to be. With the proper techniques, Pandas can help you quickly transform messy and complex datasets into clean, ready-to-analyze formats. From handling missing data to reshaping and optimizing your DataFrames, a few tricks can save you hours of work.

In this article, you will discover seven practical Pandas tips that can speed up your data prep process and help you focus more on analysis and less on cleanup.

Two of the tips are basically “use functional programming techniques,” and I’m okay with that.

Leave a Comment

The Small Data Showdown in Microsoft Fabric

Published 2025-07-01 by Kevin Feasel

Miles Cole does a bit of testing:

First, let’s revisit the purpose of the benchmark: The objective is to explore data engineering engines available in Fabric to understand whether Spark with vectorized execution (the Native Execution Engine) should be considered in small data architectures.

Beyond refreshing the benchmark to see if any core findings have changed, I do want to expand in a few areas where I got great feedback from the community:

I really appreciate the approach behind this, both in terms of sticking to more realistic data sizes for many operations as well as performing this test given all of the recent improvements in each engine.

Leave a Comment

Incremental Copy Job in Microsoft Fabric now GA

Published 2025-07-01 by Kevin Feasel

Ye Xu has an announcement:

Copy job has been a go-to tool for simplified data ingestion in Microsoft Fabric, offering a seamless data movement experience from any source to any destination. Whether you need batch or incremental copying, it provides the flexibility to meet diverse data needs while maintaining a simple and intuitive workflow.

We continuously refine Copy job based on customer feedback, enhancing both functionality and user experience. In this update, we’re introducing several key improvements designed to streamline your workflow and boost efficiency.

Click through to see what’s new.

Leave a Comment

Optimizing Multiple Lookup Transformations in SSIS

Published 2025-06-30 by Kevin Feasel

Andy Brownsword doesn’t want to keep hitting the database:

Lookup transformations provide us a way to access related values from another source, such as retrieving surrogate keys in data warehousing. When we need multiple lookups to the same reference data we can improve performance through the use of a Cache.

If we consider data warehousing, a prime example of this would be an order table which has values for Order Date, Dispatch Date, Delivery Date, etc. All of these would require a lookup to a calendar dimension.

This is a perfect use case for a cache.

Read on to see how the cache connector works.

Leave a Comment

Shortcut Transformations in Microsoft Fabric

Published 2025-06-30 by Kevin Feasel

Miquella de Boer has an announcement:

Shortcut transformations is a new capability in Microsoft Fabric that simplifies the process of converting raw files, starting with .CSV files, into Delta tables. This feature eliminates the need for traditional ETL pipelines, enabling users to transform data directly on top of files with minimal setup.

Click through to see how it all works.

Leave a Comment

Copying Data in dbatools

Published 2025-06-30 by Kevin Feasel

Haripriya Naidu makes a copy:

If you want to copy huge data from one SQL server to another, try using dbatools which has powershell module underneath.

In the demo here, I’ve compared 2 dbatools commands to move data from one SQL server to another:

Write-DbaDbTableData vs Copy-DbaDbTableData

Click through to see which one wins the speed challenge.

Leave a Comment

Azure Data Factory Publishing Everything instead of Incremental Changes

Published 2025-06-16 by Kevin Feasel

Ed Elliott troubleshoots an issue:

I recently encountered an interesting issue with ADF where the publish feature suddenly attempted to republish every single object, claiming they were new, despite having incrementally published changed objects for some time.

We were using the publish feature where you work on a branch until you are happy, then you raise a PR to main, merge to main, and then switch back to ADF and click publish to push the changes to the adf_publish branch.

Click through for the answer. I also love how Ed’s tl;dr is “too bad, read it anyhow.”

Comments closed

Azure Data Factory Data Flow Logging

Published 2025-06-16 by Kevin Feasel

Rayis Imayev does a bit of logging:

Azure Data Factory is no exception when it comes to logging options. All your debug or triggered pipeline executions—their parameters passed during execution, statuses, timings, durations, and more, can be monitored natively in Azure Data Studio. Once you immerse yourself in the realm of previously executed pipelines and start seeing all activities, passed input values, processed output results, and variables being transformed into something else that can only be understood by examining internal expressions and many other details, you begin to feel like an investigator meticulously analyzing everything.

Read on to see what kinds of logging options are available and how you can work with them.

Comments closed

Invoking Child Pipelines in Microsoft Fabric

Published 2025-06-04 by Kevin Feasel

Meagan Longoria spots the fork in the road:

At the moment there are two activities in Fabric pipelines that allow you to execute a “child” pipeline. They are both named “Invoke Pipeline” but are differentiated by the labels “Legacy” and “Preview” in parentheses.

Read on to learn more about these two and why choosing the new one may not always be the best option for you, at least not yet.

Comments closed

Salesforce to Purchase Informatica for $8 Billion

Published 2025-05-28 by Kevin Feasel

Alex Woodie prints the news:

It’s been 13 months since Salesforce and Informatica called off their first attempt at an acquisition. But the second time appears to be the charm, as Informatica today announced that Salesforce will buy it for $8 billion.

Informatica was founded in 1993 ago to serve the burgeoning market for data integration tools, in particular the need for extract, transformation, and load (ETL) tools for early data warehouses. Companies at the time needed to pull transactional data out of mainframes, midrange, and Unix systems, transform the data into a suitable format, and then load it into their analytical database.

It will be interesting to see what comes out of this.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Category: ETL / ELT