Press "Enter" to skip to content

Category: Synapse Analytics

Comparing the Dedicated and Serverless SQL Pools

Liliam Leme compares SQL pools in Azure Synapse Analytics:

Two recurring questions I frequently get from customers are: “What is the difference between Synapse dedicated SQL pool (formerly SQL DW) and Serverless SQL pool?” and “Which one should I choose for my Business?”

This post is intended to explain the basic concepts of dedicated SQL pool and Serverless SQL Pool, help you understand how they work, and how to use them based on your business needs.

Click through for the comparison.

Comments closed

Spark Query Optimization in Synapse

Daniel Coelho lays out a few optimizations in Azure Synapse Analytics Spark pools:

The Azure Synapse Analytics team has prominent engineers enhancing and contributing back to the Apache Spark project. One of our focus areas is Spark query optimization techniques, where Microsoft has decades of experience and is making significant contributions to the Apache Spark open source engine.

The attachment at the bottom of this blog post will be presented at the 48th International Conference on Very Large Databases (#VLDB2022) and covers the latest developments in query optimization for Apache Spark 3. Those optimizations were developed by Microsoft engineers and are available today in the Azure Synapse runtime for Apache Spark versions 3.1 and 3.2.

Check out the high-level updates as well as a complete technical paper laying out the changes.

Comments closed

Building a Lakehouse with Azure Synapse Analytics

Arshad Ali does a bit of construction:

Data Lakehouse architecture has become the de facto standard for designing and building data platforms for analytics as it bridges the gap and breaks the silos created by the traditional/modern data warehouse and the data lake. This blog post introduces you to the world of data lakehouse and it goes into details of how to implement it successfully in Azure with Azure Synapse Analytics.

Read the whole thing.

Comments closed

Data Modification with Synapse Link for SQL Server 2022

Kevin Chant changes some data:

In this post I want to cover some things that happen internally when you do updates and deletes with Azure Synapse Link for SQL Server 2022 whilst it is running.

Because recently somebody asked if Azure Synapse Link for SQL Server 2022 captures updates and deletes after they had read a previous post. Where I covered my initial tests for Azure Synapse Link for SQL Server 2022.

Anyway, short answer is that Azure Synapse Link for SQL Server 2022 captures updates and deletes. In this post I will go into more detail about some of the things that appear to happen along the way.

Click through for Kevin’s tests and what the results look like.

Comments closed

Azure Synapse Analytics August 2022 Updates

Ryan Majidimehr has a changelog for us:

Full support for MLflow

MLflow is a platform for managing the machine learning lifecycle and streamline machine learning development, including tracking experiments, packaging code into reproducible runs, and sharing and deploying models. We are very happy to announce that SynapseML models now integrates with MLflow with full support for saving, loading, deployment, and autologging!

To learn more, read MLflow in SynapseML getting started guide and SynapseML Autologging.

There are quite a few changes on this list, so they’ve definitely been busy.

Comments closed

Azure Synapse Link for SQL Server 2022 and File Analysis

Kevin Chant digs into Azure Synapse Link for SQL Server 2022:

In this post I want to cover some file tests for Azure Synapse Link for SQL Server 2022 that I performed.

Because a while back I spotted something interesting whilst I was doing some initial tests for Azure Synapse Link for SQL Server 2022.

Which is when you add new data after the initial load that a new folder called ‘ChangeData’ appears in the storage account container. I noticed that the new file containing the insert was a comma separated value (csv) file. Whereas the table used for the initial load was a parquet file.

Is there a method to this madness? Click through to see Kevin’s tell-all story.

Comments closed

What’s New in SynapseML

Nellie Gustafsson and Mark Hamilton share an update:

SynapseML is a massively scalable (feel free to spin up hundreds of machines!) machine learning library built on Apache Spark. SynapseML makes it easy to train production-ready models to solve problems from simple classification and regression to anomaly detection, translation, image analysis, speech to text, and just about any ML challenge you are facing.  Under the hood, SynapseML integrates a wide array of ML technologies such as LightGBM, Vowpal Wabbit, ONNX, and the Cognitive Services into a single easy to use API compatible with MLFlow. We know, we know, everyone hates when developers invent new APIs, but you can rest easy because SynapseML integrates cleanly into existing Spark ML APIs so you can embed models directly into existing pipelines. We strive to make SynapseML available to developers wherever they work, and the library is available in a variety of languages like Python, Scala, Java, R. As of this release SynapseML is also usable from .NET, C#, F#.

Saving the best language for last, I see. Click through for the list of updates.

Comments closed

Database-Driven Parameterization for Synapse Pipelines

Paul Hernandez does some configuring:

Particularly in Synapse, there are even no global parameters like in Azure Data Factory. 

When you want to move your development to another environment, typically CI/CDs pipelines are used. These pipelines consume an ARM template together with its parameter file to create a workspace in a target environment. The parameters can be overriding in the CD pipeline as explain here: https://techcommunity.microsoft.com/t5/data-architecture-blog/ci-cd-in-azure-synapse-analytics-part-4-the-release-pipeline/ba-p/2034434

Even so, I have not found a proper way to change the values of a pipeline parameter (the same for data flows and datasets parameters). I saw some custom parameters manipulation to set the default value of a parameter and then deploy it without any value, or even JSON manipulation with PowerShell (the dark side for me).

Read on for an alternative solution which does the job well.

Comments closed

Feeding Synapse Spark Info to On-Prem Kafka Clusters

Bhadreshkumar Shiyal finds a solution:

Microsoft’s official documentation for Azure Data Factory contains a tutorial which explains how to access an On-Premises SQL Server from Azure Data Factory which is inside a Managed Vnet. You can go through that article here: Access on-premises SQL Server from Data Factory Managed Vnet using Private Endpoint – Azure Data Fac….

Although based upon the article’s solution, to meet our requirements we needed to substitute On-Prem Apache Kafka for On-Prem SQL Server and instead of an Azure Data Factory inside a Managed Vnet, we used a Synapse Workspace inside a Managed Vnet. The “Forwarding Vnet” concept explained in the above tutorial remains as-is in our approach.

As soon as you turn on Data Exfiltration Protection (DEP), the lockdown is real. Click through to see what the process of exfiltrating data through an approved mechanism looks like.

Comments closed

Azure Synapse Analytics July 2022 Updates

Ryan Majidimehr notes that the Azure Synapse Analytics team has been busy:

Azure Synapse Link for SQL is an automated system for replicating data from your transactional databases into a dedicated SQL pool in Azure Synapse Analytics. Starting this month, you can make trade-offs between cost and latency in Synapse Link for SQL by selecting the continuous or batch mode to replicate your data.  

By selecting “continuous mode”, the runtime will be running continuously so that any changes applied to the SQL database or SQL Server will be replicated to Synapse with low latency. Alternatively, when you select “batch mode” with a specified interval, the changes applied to the SQL database or SQL Server will be accumulated and replicated to Synapse in batch mode with the specified interval. This can save cost as you are only charged for the time the runtime is required to replicate data. After each batch of data is replicated, the runtime will be shut down automatically. 

Click through for the complete list.

Comments closed