Press "Enter" to skip to content

Category: Synapse Analytics

An Overview of Azure Synapse Analytics

Kevin Chant offers a primer on Azure Synapse Analytics:

In reality, there are a lot of features within Azure Synapse Analytics where your SQL Server background can prove to be useful.

By the end of this post, you will have a good overview of Azure Synapse Analytics. In addition, where your SQL Server background can prove to be useful. Plus, there are plenty of links included in this post.

This is not the slimmest of primers, which makes sense given how broad Synapse is.

Comments closed

Slowly-Changing Dimensions in the Serverless SQL Pool

Lilliam Leme is building a serverless warehouse:

As organizations continue to collect and store large volumes of data in their data lakes, managing this data effectively becomes increasingly important. One key aspect of this is implementing Slow Change Dimension type 2, which allows organizations to track historical data by creating multiple records for a given natural key in the dimensional tables with separate surrogate keys and/or different version numbers. In this blog post we will address the following scenario: a customer wants to implement Slow Change Dimension type 2 on top of their data lake.

For this example, we will use Serverless SQL Pool to demonstrate how this can be done. Additionally, in the next post, we will explore how the same approach can be used with Spark.

This turns out to be more work than a classic SQL Server-based solution because of the fact that the serverless SQL pool is read-only, save for CETAS statements.

Comments closed

Azure Synapse Analytics April 2023 Updates

Ryan Majidimehr has an update for us:

Low Shuffle Merge optimization for Delta tables is now available in Apache Spark 3.2 and 3.3 pools. You can now update a Delta table with advanced conditions using the Delta Lake MERGE command. It can update data from a source table, view, or DataFrame into a target table. The current algorithm of the MERGE command is not optimized for handling unmodified rows. With Low Shuffle Merge optimization, unmodified rows are excluded from expensive shuffling execution and written separately.

To learn more about this new command, read Low Shuffle Merge optimization on Delta tables

Looks like a bit of work on Data Explorer pools and a little bit on Spark pools and Synapse Link to Cosmos DB to round out the month.

Comments closed

Converting JSON to a Relational Schema with KQL

Devang Shah does some flattening and moving:

In the world of IoT devices, industrial historians, infrastructure and application logs, and metrics, machine-generated or software-generated telemetry, there are often scenarios where the upstream data producer produces data in non-standard schemas, formats, and structures that often make it difficult to analyze the data contained in these at scale. Azure Data Explorer provides some useful features to run meaningful, fast, and interactive analytics on such heterogenous data structures and formats. 

In this blog, we’re taking an example of a complex JSON file as shown in the screenshot below. You can access the JSON file from this GitHub page to try the steps below.

Click through for the example, which is definitely non-trivial.

Comments closed

Spark ELT in Synapse Notebooks

Liliam Leme performs some data movement:

I often receive various requests from customers while working on FastTrack projects, and I have compiled some examples to help you build your solution on top of a data lake using useful tips. Most of the examples in this post use pandas, and I hope they will be helpful for you as they were for me.

Please note that all examples in this post use pyspark.

In my scenario, I exported multiple tables from SQLDB to a folder using a notebook and ran the requests in parallel.

Read on for the examples and some of the things you can do with Spark notebooks in Azure Synapse Analytics.

Comments closed

Azure Synapse Analytics March 2023 Update

Ryan Majidimehr has an update for us:

We are excited to announce that Multi-Column Distribution (MCD) for Azure Synapse Dedicated SQL pools is now Generally Available! MCD is highly desirable for easing migrations, promotes faster query performance, and reduces data skew. 

You can choose to distribute data on multiple columns to balance the data distribution in your tables and reduce data movement during query execution. Multi-Column distribution will allow you to choose up to eight columns for distribution. 

Click through for the full list of changes. It seems like there’s a decent spread between the four major pool types, with the emphasis on Data Explorer pools this month.

Comments closed

Troubleshooting Azure Synapse Link for SQL Server Issues

Kevin Chant diagnoses an issue:

In this post I want to cover common Azure Synapse Link for SQL storage permission issues. Since I helped a fellow MVP out with this recently.

To be more precise, I want to show how you can fix one of the most common issues I tend to encounter with Azure Synapse Link for SQL. Which is access to the Data Lake Storage Gen2 account.

I have encountered issues like this a few times now. For example, when I was performing my file tests for Azure Synapse Link for SQL Server 2022.

Click through to learn more about a couple of common issues, their causes, and resolutions.

Comments closed

Tips for Using a Data Lakehouse

James Serra shares some advice:

As I mentioned in my Data Mesh, Data Fabric, Data Lakehouse presentation, the data lakehouse architecture, where you use a data lake with delta lake as a software layer and skip using a relational data warehouse, is becoming more and more popular. For some customers, I will recommend “Use a data lake until you can’t”. What I mean by this is to take the following steps when building a new data architecture in Azure with Azure Synapse Analytics:

Click through for six notes.

Comments closed

Performance Tuning a Dedicated SQL Pool

Sarath Sasidharan has some guidance for us:

Synapse Dedicated pools have been battle tested at enterprise customers across the globe. We deal with data in the magnitude of PetaBytes. Synapse can provide you with the scale of the cloud and the high performance required for your enterprise-grade requirements.  The key to maximizing your performance is to follow best practices, check out best practices for dedicated SQL pools in Azure Synapse Analytics

Failure to do so causes performance issues. In such scenarios, is it important to understand where the bottlenecks are. This blog focuses on the different steps a query goes through; from the time the query is fired from the client until it returns back.  Delay caused in any of the steps would impact the overall run-time of the query and hence indicate degraded performance.

Click through for a walkthrough of each step along the way, potential problems you could run into, and remediations for those problems. Much of the advice is similar to what you’d get with SQL Server, though there are differences interspersed throughout each level.

Comments closed

February 2023 Updates for Azure Synapse Analytics

Ryan Majidimehr has a new round-up for us:

Azure Synapse Runtime for Apache Spark 3.3 has been in Public Preview since November 2022. We are excited to announce that after notable improvements in performance and stability, Azure Synapse Runtime for Apache Spark 3.3 now becomes Generally Available and ready for production workloads.   

The essential changes include features that come from upgrading Apache Spark to version 3.3.1, Delta Lake to version 2.2.0, and Python to 3.10. 

This month’s set of changes isn’t quite as big as some prior months, though there are a couple items of great importance to make up for it.

Comments closed