Press "Enter" to skip to content

Category: ETL / ELT

The Small Data Showdown in Microsoft Fabric

Miles Cole does a bit of testing:

First, let’s revisit the purpose of the benchmark: The objective is to explore data engineering engines available in Fabric to understand whether Spark with vectorized execution (the Native Execution Engine) should be considered in small data architectures.

Beyond refreshing the benchmark to see if any core findings have changed, I do want to expand in a few areas where I got great feedback from the community:

I really appreciate the approach behind this, both in terms of sticking to more realistic data sizes for many operations as well as performing this test given all of the recent improvements in each engine.

Leave a Comment

Incremental Copy Job in Microsoft Fabric now GA

Ye Xu has an announcement:

Copy job has been a go-to tool for simplified data ingestion in Microsoft Fabric, offering a seamless data movement experience from any source to any destination. Whether you need batch or incremental copying, it provides the flexibility to meet diverse data needs while maintaining a simple and intuitive workflow.

We continuously refine Copy job based on customer feedback, enhancing both functionality and user experience. In this update, we’re introducing several key improvements designed to streamline your workflow and boost efficiency.

Click through to see what’s new.

Leave a Comment

Optimizing Multiple Lookup Transformations in SSIS

Andy Brownsword doesn’t want to keep hitting the database:

Lookup transformations provide us a way to access related values from another source, such as retrieving surrogate keys in data warehousing. When we need multiple lookups to the same reference data we can improve performance through the use of a Cache.

If we consider data warehousing, a prime example of this would be an order table which has values for Order Date, Dispatch Date, Delivery Date, etc. All of these would require a lookup to a calendar dimension.

This is a perfect use case for a cache.

Read on to see how the cache connector works.

Leave a Comment

Azure Data Factory Publishing Everything instead of Incremental Changes

Ed Elliott troubleshoots an issue:

I recently encountered an interesting issue with ADF where the publish feature suddenly attempted to republish every single object, claiming they were new, despite having incrementally published changed objects for some time.

We were using the publish feature where you work on a branch until you are happy, then you raise a PR to main, merge to main, and then switch back to ADF and click publish to push the changes to the adf_publish branch.

Click through for the answer. I also love how Ed’s tl;dr is “too bad, read it anyhow.”

Leave a Comment

Azure Data Factory Data Flow Logging

Rayis Imayev does a bit of logging:

Azure Data Factory is no exception when it comes to logging options. All your debug or triggered pipeline executions—their parameters passed during execution, statuses, timings, durations, and more, can be monitored natively in Azure Data Studio. Once you immerse yourself in the realm of previously executed pipelines and start seeing all activities, passed input values, processed output results, and variables being transformed into something else that can only be understood by examining internal expressions and many other details, you begin to feel like an investigator meticulously analyzing everything.

Read on to see what kinds of logging options are available and how you can work with them.

Leave a Comment

Invoking Child Pipelines in Microsoft Fabric

Meagan Longoria spots the fork in the road:

At the moment there are two activities in Fabric pipelines that allow you to execute a “child” pipeline. They are both named “Invoke Pipeline” but are differentiated by the labels “Legacy” and “Preview” in parentheses.

Read on to learn more about these two and why choosing the new one may not always be the best option for you, at least not yet.

Comments closed

Salesforce to Purchase Informatica for $8 Billion

Alex Woodie prints the news:

It’s been 13 months since Salesforce and Informatica called off their first attempt at an acquisition. But the second time appears to be the charm, as Informatica today announced that Salesforce will buy it for $8 billion.

Informatica was founded in 1993 ago to serve the burgeoning market for data integration tools, in particular the need for extract, transformation, and load (ETL) tools for early data warehouses. Companies at the time needed to pull transactional data out of mainframes, midrange, and Unix systems, transform the data into a suitable format, and then load it into their analytical database.

It will be interesting to see what comes out of this.

Comments closed

Common Data Transformations in Microsoft Fabric

Nikola Ilic takes us through several data transformations:

In the lakehouse, for example, you can transform the data by using PySpark, but also Spark SQL, which is VERY similar to Microsoft’s dialect of SQL, called Transact-SQL (or T-SQL, abbreviated). In the warehouse, you can apply transformations using T-SQL, but Python is also an option by leveraging a special pyodbc library. Finally, in the KQL database, you can run both KQL and T-SQL statements. As you may rightly assume, the lines are blurred, and sometimes the path is not 100% clear.

Therefore, in this article, I’ll explore five common data transformations and how to perform each one using three Fabric languages: PySpark, T-SQL, and KQL.

Click through for those transformations, such as extracting date parts, fixing casing, and pivoting data.

Comments closed