Press "Enter" to skip to content

Category: ETL / ELT

Using a Child Pipeline Variable in a Parent Pipeline in Fabric Data Factory

Justin Bird passes back some information:

I answered a question on the Fabric community on return variables recently and thought I would expand upon it in a blog post. The question was how to use a variable derived in a child pipeline downstream in the parent pipeline. The person was specifically deriving a json object and wanted to iterate on the values in the parent pipeline.

Click through for the solution.

Leave a Comment

Optimizing Multi-Notebook Jobs in Microsoft Fabric and AWS Glue

Daniel Janik flips a switch:

Are your Azure Fabric pipelines with multiple notebooks running slower than you’d like? Are you paying for more Spark compute time than you should be? The culprit might be a simple setting that’s easy to miss. In this blog post, we’ll dive into the “For pipeline running multiple notebooks” setting in Azure Fabric and explain why enabling it can significantly improve your pipeline’s performance and reduce your costs.

Click through for this, as well as a comparison with AWS Glue and ways to perform something similar there.

Leave a Comment

Data Cleansing Tips in Pandas

Jayita Gulati shares some tips:

Data preparation is one of the most time-consuming parts of any data science or analytics project, but it doesn’t have to be. With the proper techniques, Pandas can help you quickly transform messy and complex datasets into clean, ready-to-analyze formats. From handling missing data to reshaping and optimizing your DataFrames, a few tricks can save you hours of work.

In this article, you will discover seven practical Pandas tips that can speed up your data prep process and help you focus more on analysis and less on cleanup.

Two of the tips are basically “use functional programming techniques,” and I’m okay with that.

Comments closed

The Small Data Showdown in Microsoft Fabric

Miles Cole does a bit of testing:

First, let’s revisit the purpose of the benchmark: The objective is to explore data engineering engines available in Fabric to understand whether Spark with vectorized execution (the Native Execution Engine) should be considered in small data architectures.

Beyond refreshing the benchmark to see if any core findings have changed, I do want to expand in a few areas where I got great feedback from the community:

I really appreciate the approach behind this, both in terms of sticking to more realistic data sizes for many operations as well as performing this test given all of the recent improvements in each engine.

Comments closed

Incremental Copy Job in Microsoft Fabric now GA

Ye Xu has an announcement:

Copy job has been a go-to tool for simplified data ingestion in Microsoft Fabric, offering a seamless data movement experience from any source to any destination. Whether you need batch or incremental copying, it provides the flexibility to meet diverse data needs while maintaining a simple and intuitive workflow.

We continuously refine Copy job based on customer feedback, enhancing both functionality and user experience. In this update, we’re introducing several key improvements designed to streamline your workflow and boost efficiency.

Click through to see what’s new.

Comments closed

Optimizing Multiple Lookup Transformations in SSIS

Andy Brownsword doesn’t want to keep hitting the database:

Lookup transformations provide us a way to access related values from another source, such as retrieving surrogate keys in data warehousing. When we need multiple lookups to the same reference data we can improve performance through the use of a Cache.

If we consider data warehousing, a prime example of this would be an order table which has values for Order Date, Dispatch Date, Delivery Date, etc. All of these would require a lookup to a calendar dimension.

This is a perfect use case for a cache.

Read on to see how the cache connector works.

Comments closed

Copying Data in dbatools

Haripriya Naidu makes a copy:

If you want to copy huge data from one SQL server to another, try using dbatools which has powershell module underneath.

In the demo here, I’ve compared 2 dbatools commands to move data from one SQL server to another:

Write-DbaDbTableData vs Copy-DbaDbTableData

Click through to see which one wins the speed challenge.

Comments closed

Azure Data Factory Publishing Everything instead of Incremental Changes

Ed Elliott troubleshoots an issue:

I recently encountered an interesting issue with ADF where the publish feature suddenly attempted to republish every single object, claiming they were new, despite having incrementally published changed objects for some time.

We were using the publish feature where you work on a branch until you are happy, then you raise a PR to main, merge to main, and then switch back to ADF and click publish to push the changes to the adf_publish branch.

Click through for the answer. I also love how Ed’s tl;dr is “too bad, read it anyhow.”

Comments closed

Azure Data Factory Data Flow Logging

Rayis Imayev does a bit of logging:

Azure Data Factory is no exception when it comes to logging options. All your debug or triggered pipeline executions—their parameters passed during execution, statuses, timings, durations, and more, can be monitored natively in Azure Data Studio. Once you immerse yourself in the realm of previously executed pipelines and start seeing all activities, passed input values, processed output results, and variables being transformed into something else that can only be understood by examining internal expressions and many other details, you begin to feel like an investigator meticulously analyzing everything.

Read on to see what kinds of logging options are available and how you can work with them.

Comments closed