Press "Enter" to skip to content

Category: ETL / ELT

Spark ELT in Synapse Notebooks

Liliam Leme performs some data movement:

I often receive various requests from customers while working on FastTrack projects, and I have compiled some examples to help you build your solution on top of a data lake using useful tips. Most of the examples in this post use pandas, and I hope they will be helpful for you as they were for me.

Please note that all examples in this post use pyspark.

In my scenario, I exported multiple tables from SQLDB to a folder using a notebook and ran the requests in parallel.

Read on for the examples and some of the things you can do with Spark notebooks in Azure Synapse Analytics.

Comments closed

Automating Self-Hosted Integration Runtime Deployment

Jonathan D’Aloia doesn’t want to click next-next-next:

Welcome to my blog on how to fully automate the deployment of a Self-Hosted Integration Runtime using Terraform!

The title of this blog is very much self-explanatory but I hope you find the contents useful and are able to apply this on your projects in some aspect.

Click through for a brief overview of self-hosted integration runtimes, the process to follow, and a link to the repo.

Comments closed

Data Pipelines and Data Mesh

Jean-Georges Perrin answers a burning question:

I keep having questions about data pipelines. Data pipelines in Data Mesh is a topic I should tackle. So… Is the data pipeline the root of all evil?

Jean-Georges’s answer is quite in line with one of my favorite phrases: “Short answer: no, with an ‘if’; long answer: yes, with a ‘but.'” Read on for some thoughts on data pipelines and what the data mesh concept does to minimize harm.

Comments closed

Trying out Azure Synapse Link for SQL Server 2022

Kevin Chant looks at Azure Synapse Link for SQL Server 2022:

My first topic is about a new feature that covers both SQL Server 2022 and Azure. Which is Azure Synapse Link, or to be more precise Azure Synapse Link for SQL Server 2022.

I have been doing various tests with this feature recently. Which has led to some interesting blog posts about Azure Synapse Link for SQL Server 2022.

Read on for a few more thoughts, as well as deployment scripts via Azure DevOps and GitHub Actions.

Comments closed

Orchestrating Synapse Notebooks and Spark Jobs from ADF

Abhishek Narain has an announcement:

Today, we are introducing support for orchestrating Synapse notebooks and Synapse spark job definitions (SJD) natively from Azure Data Factory pipelines. It immensely helps customers who have invested in ADF and Synapse Spark without requiring to switch to Synapse Pipelines for orchestrating Synapse Notebooks and SJD. 

NOTESynapse notebook and SJD activities were only available in Synapse Pipelines previously. 

If you’re familiar with Synapse Pipelines, the equivalent ADF operations are extremely similar, as you’d probably expect.

Comments closed

Limiting Data Factory Users to Trigger Pipelines

Koen Verbeeck doesn’t want people running amok:

Typically you have a bunch of pipelines that are started by one or more triggers. Sometimes, a pipeline needs to be manually triggered. For example, when the finance department is closing the fiscal year, they probably want to run the ETL pipeline a couple of times on-demand, to make sure their latest changes are reflected in the reports. Since you don’t want them to contact you every time to start a pipeline, it might be an idea to give them permission to start the pipeline themselves.

This can obviously be done by tools such as Azure Logic Apps or a Power App, but in my case the users also wanted to view the progress of the pipeline (did something crash? Why is it taking so long? etc.) and developing a Power App with all those features seemed a bit cumbersome to me. Instead, we gave them permission on ADF itself so they can start the pipelines. There’s one problem though, there’s only one role for ADF in Azure, and it’s the contributor role. A bit too much permission, as anyone with that role can change anything in ADF. You don’t want that.

So what can you do? Click through to find out.

Comments closed

Error Handling Patterns in ADF Pipelines

Chenye Charlie Zhu begins a new series:

Orchestration allows conditional logic and enables user to take different based upon outcomes of a previous activity. Building upon the concepts of conditional paths, ADF and Synapse pipeline allows users to build versatile and resilient work flows that can handle unexpected errors that work smoothly in auto-pilot mode.

This is an ongoing series that gradually level up and help you build even more complicated logic to handle more scenarios. We will walk through examples for some common use cases, and help you to build functional and useful work flows.

Read on for a few error-handling patterns.

Comments closed

Logic App Errors with Variables in Sharepoint Actions

Koen Verbeeck troubleshoots an issue:

I have a Logic App that reads out a SharePoint library and stores all the documents found into Azure Blob Storage (ADF only supports Lists). I was trying to make this Logic App “generic”, meaning I could change the source folder and the destination container by using variables. That way, I have one single Logic App which can read out any SharePoint library, instead of creating a new Logic App for each library.

So I adapted my HTTP trigger to accept a JSON payload, which contains the name of the folder on SharePoint and the name of the blob container.

Read on to see the error message, as well as how Koen resolved the problem.

Comments closed

Cosmos DB to Data Explorer Synapse Link

Vicent-Philippe Lauzon makes an announcement:

We recently made our new Kusto data connection available in public preview:  Cosmos DB to Azure Data Explorer Synapse Link.

This does look like a marketing-heavy announcement but the short version is that you can ingest data from Cosmos DB into Data Explorer pools via Synapse Link rather than creating your own ETL process. The previous Cosmos DB connector for Synapse Link tied to a dedicated SQL pool.

Comments closed

Row-Level Security and Data Migration

Forrest McDaniel shares an interesting case of using row-level security:

This was the situation I found myself in earlier this year – our company had absorbed another, and it was time to slurp up their tables. There were a lot of decisions to make and tradeoffs to weigh, and we ended up choosing to trickle-insert their data, but make it invisible to normal use until the moment of cutover.

The way we implemented this was with Row Level Security. Using an appropriate predicate, we could make sure ETL processes only saw migrated data, apps saw unmigrated data, and admins saw everything. To give a spoiler: it worked, but there were issues.

I would not have thought of this scenario. And given the difficulties Forrest & crew ran into, it might be for the best…

Comments closed