Press "Enter" to skip to content

Category: Synapse Analytics

From Azure Data Factory to Synapse Pipelines

Kevin Chant copies and pastes:

In this post I want to share an alternative way to copy an Azure Data Factory pipeline to Synapse Studio. Because I think it can be useful.

For those who are not aware, Synapse Studio is the frontend that comes with Azure Synapse Analytics. You can find out more about it in another post I did, which was a five minute crash course about Synapse Studio.

By the end of this post, you will know one way to copy objects used for an Azure Data factory pipeline to Synapse Studio. Which works as long as both are configured to use Git.

Click through to see how.

Comments closed

Spark Performance Improvements in Azure Synapse

Balaji Sankaran shows improvements Microsoft has made over open-source Apache Spark 3 in Azure Synapse Analytics:

Azure Synapse Analytics is continually focused on delivering a highly performant and scalable platform for supporting Spark Workload. We are focused on improving the query performance for the typical workload patterns that we see with our customers. By combining the latest open-source updates in Apache Spark with our team’s focus on performance updates we have made significant performance gains in standard TPC-DS benchmarking tests.

I expect it will never be as fast as what Databricks can do, but getting a 2x performance improvement over the open source version of Spark is nothing to sneeze at.

Comments closed

Distribution Techniques in Azure Synapse Analytics

Gauri Mahajan takes us through three distribution techniques when working with Azure Synapse Analytics dedicated SQL pool tables:

Data warehouses host much larger volumes of data compared to transactional databases, the volume of reads is much more compared to writes and queries tend to result in much larger result sets compared to queries that retrieve scalar values or paginated record sets from transactional databases. Due to this nature of data warehouses, there is a higher impetus on the server to perform faster. Modern data warehouses like AWS Redshift, Azure Synapse, Snowflake and others employ approaches like data sharding where data is distributed horizontally on multiple nodes which process data in parallel. This approach is highly scalable as nodes can be easily added to a data cluster as the storage and performance need increases. One key aspect that is different for tables hosted on such data warehouses is that tables are distributed horizontally using different distribution algorithms, so that all the nodes in an Azure Synapse cluster have an equal share of responsibility for hosting, processing, and delivering data for any given query to maximize performance.

In this article, we will learn about the table distribution styles supported in an Azure Synapse and how to use them for creating distributed tables.

Read on to learn more. This is an example of something we don’t think about on the SQL Server side, so when moving to Azure Synapse Analytics dedicated SQL pools, it can be easy to get this wrong and end up with sub-optimal performance.

Comments closed

Interchangability between ADF and Synapse Integration Pipelines

Paul Andrew makes a discovery:

Inspired by an earlier blog where we looked at ‘How Interchangeable Delta Tables Are Between Databricks and Synapse‘ I decided to do a similar exercise, but this time with the integration pipeline components taking centre stage.

As I said in my previous blog post, the question in the heading of this blog should be incredibly pertinent to all solution/technical leads delivering an Azure based data platform solution so to answer it directly:

Read on to learn the answer.

Comments closed

Configuring Classifiers for Workload Management

Reiss McSporran continues a series on workload management in Azure Synapse Analytics:

So in part 1 we found out what Workload Management was and why we wanted to use it. (If you haven’t read that part, please click here and read that post before you carry on, as we’re jumping in at the deep end!) In this part we’ll be learning how to configure it.

Continuing from the end of the previous post, you have two main options to configure this. First is to assign a user or role to one of the existing system defined workload groups, second is to create a custom workload group and assign a user or role to this instead.

Read on to learn how to create classifiers, what the rules look like for them, and their importance.

Comments closed

Querying Private Blob Storage Containers with Azure Synapse Analytics

Dennes Torres looks at some private information:

The queries from the previous article were made against the public container in the blob storage. However, if the container is private, you will need to authenticate with the container. In this article, you’ll learn how to query private blob storage with SQL.

NOTE: Be sure that the Azure Synapse Workspace and the storage account with the sample files are set up before following along with this article. You will also need to replace your storage account URL each time that a storage account URL is used in the article.

There are three possible authentication methods, and these methods may have some variation according to the type of storage account and the access configuration. I will not dig into details about storage here and leave that for a future article.

Read on for the three authorization methods and a lot of detail on using SAS tokens (the preferred method) to access this data.

Comments closed

A Review of Serverless SQL Pools

Teo Lachev lays out a review of serverless SQL pools in Azure Synapse Analytics:

Being able to query files using SQL is great. Some of you might remember the U-SQL language that was introduced a few years ago alongside Azure Data Lake Storage (ADLS) Gen 1 which is now deprecated. It never caught up because it looked like SQL but it wasn’t (it was actually closer to C#). Now we’re talking about real SQL. To query files! This opens the possibility to implement a logical warehouse (the emphasis is on logical as everyone to my knowledge who tried to replace a data warehouse with a data lake has failed). Or, you can connect Power BI to the serverless endpoint and start querying all these files in DirectQuery mode. So, this enables real-time BI on top of file extracts.

Click through for the pros and cons of using serverless SQL pools today.

Comments closed

Moving Artifacts between Folders in Synapse Studio

Wolfgang Strasser looks at a recent update:

Another small but very powerful usability extension in Azure Synapse Studio was added at the beginning of June: Move artifacts across folders in Synapse Studio (without extra clicks but with drag&drop)

Once again, the release notes list contained the short sentence that made me curious… hmm… that sound nice… In one of my previous post, I described the “old” way of moving artifacts around in Synapse Studio.

Click through for a demonstration.

Comments closed

Workload Management in Azure Synapse Analytics

Reiss McSporran starts a new series on workload management in Azure Synapse Analytics:

So you’ve started using Synapse SQL Pools, you’ve scaled up to improve performance, but your queries aren’t going as fast as you’d hoped. You take a look at resource utilisation and you see that as you scale up, your total resource utilisation per query goes down. But didn’t we scale up to allow our query to use more resource and run quicker? What’s going on here? A little thing called Workload Management.

What is Workload Management? At a very high level, Workload Management allows you to control how your workload uses system resources. The main problem here is defaults!

Click through for the full story. Or at least the full part 1 of the even fuller story.

Comments closed