Press "Enter" to skip to content

Category: ETL / ELT

Optimizing Multiple Lookup Transformations in SSIS

Andy Brownsword doesn’t want to keep hitting the database:

Lookup transformations provide us a way to access related values from another source, such as retrieving surrogate keys in data warehousing. When we need multiple lookups to the same reference data we can improve performance through the use of a Cache.

If we consider data warehousing, a prime example of this would be an order table which has values for Order Date, Dispatch Date, Delivery Date, etc. All of these would require a lookup to a calendar dimension.

This is a perfect use case for a cache.

Read on to see how the cache connector works.

Comments closed

Copying Data in dbatools

Haripriya Naidu makes a copy:

If you want to copy huge data from one SQL server to another, try using dbatools which has powershell module underneath.

In the demo here, I’ve compared 2 dbatools commands to move data from one SQL server to another:

Write-DbaDbTableData vs Copy-DbaDbTableData

Click through to see which one wins the speed challenge.

Comments closed

Azure Data Factory Publishing Everything instead of Incremental Changes

Ed Elliott troubleshoots an issue:

I recently encountered an interesting issue with ADF where the publish feature suddenly attempted to republish every single object, claiming they were new, despite having incrementally published changed objects for some time.

We were using the publish feature where you work on a branch until you are happy, then you raise a PR to main, merge to main, and then switch back to ADF and click publish to push the changes to the adf_publish branch.

Click through for the answer. I also love how Ed’s tl;dr is “too bad, read it anyhow.”

Comments closed

Azure Data Factory Data Flow Logging

Rayis Imayev does a bit of logging:

Azure Data Factory is no exception when it comes to logging options. All your debug or triggered pipeline executions—their parameters passed during execution, statuses, timings, durations, and more, can be monitored natively in Azure Data Studio. Once you immerse yourself in the realm of previously executed pipelines and start seeing all activities, passed input values, processed output results, and variables being transformed into something else that can only be understood by examining internal expressions and many other details, you begin to feel like an investigator meticulously analyzing everything.

Read on to see what kinds of logging options are available and how you can work with them.

Comments closed

Invoking Child Pipelines in Microsoft Fabric

Meagan Longoria spots the fork in the road:

At the moment there are two activities in Fabric pipelines that allow you to execute a “child” pipeline. They are both named “Invoke Pipeline” but are differentiated by the labels “Legacy” and “Preview” in parentheses.

Read on to learn more about these two and why choosing the new one may not always be the best option for you, at least not yet.

Comments closed

Salesforce to Purchase Informatica for $8 Billion

Alex Woodie prints the news:

It’s been 13 months since Salesforce and Informatica called off their first attempt at an acquisition. But the second time appears to be the charm, as Informatica today announced that Salesforce will buy it for $8 billion.

Informatica was founded in 1993 ago to serve the burgeoning market for data integration tools, in particular the need for extract, transformation, and load (ETL) tools for early data warehouses. Companies at the time needed to pull transactional data out of mainframes, midrange, and Unix systems, transform the data into a suitable format, and then load it into their analytical database.

It will be interesting to see what comes out of this.

Comments closed

Common Data Transformations in Microsoft Fabric

Nikola Ilic takes us through several data transformations:

In the lakehouse, for example, you can transform the data by using PySpark, but also Spark SQL, which is VERY similar to Microsoft’s dialect of SQL, called Transact-SQL (or T-SQL, abbreviated). In the warehouse, you can apply transformations using T-SQL, but Python is also an option by leveraging a special pyodbc library. Finally, in the KQL database, you can run both KQL and T-SQL statements. As you may rightly assume, the lines are blurred, and sometimes the path is not 100% clear.

Therefore, in this article, I’ll explore five common data transformations and how to perform each one using three Fabric languages: PySpark, T-SQL, and KQL.

Click through for those transformations, such as extracting date parts, fixing casing, and pivoting data.

Comments closed

Shortcut Caching in Microsoft Fabric now GA

Trevor Olson announces a feature has become generally available:

Shortcuts in OneLake allow you to quickly and easily source data from external cloud providers and use it across all Fabric workloads such as Power BI reports, SQL, Spark and Kusto.  However, each time these workloads read data from cross-cloud sources, the source provider (AWS, GCP) charges additional egress fees on the data. Thankfully, shortcut caching allows the data to only be sourced once and then used across all Fabric workloads without additional egress fees.

This is useful for data that hardly ever changes, and Trevor also shows you who can control the cache length and reset the cache. In addition, the on-premises gateway for shortcuts is now generally available, so you can take shortcuts of certain on-prem file systems.

Comments closed

Kafka Data Exploration with Tableflow

Robin Moffatt does some exploratory data analysis:

One of the challenges that I’d always had when it came to building streaming data pipelines is that once data is in a Kafka topic, it becomes trickier to query. Whether limited by the available tools to do this or the speed of access, querying Kafka is just not a smooth experience.

This blog post will show you a really nice way of exploring and validating data in Apache Kafka®. We’ll use Tableflow to expose the Kafka topics as Apache Iceberg™️ tables and then query them using standard SQL tools.

Click through for the demonstration using a real dataset.

Comments closed