Synapse Analytics – Page 10

In this post I want to cover some file tests for Azure Synapse Link for SQL Server 2022 that I performed.
Because a while back I spotted something interesting whilst I was doing some initial tests for Azure Synapse Link for SQL Server 2022.
Which is when you add new data after the initial load that a new folder called ‘ChangeData’ appears in the storage account container. I noticed that the new file containing the insert was a comma separated value (csv) file. Whereas the table used for the initial load was a parquet file.

Is there a method to this madness? Click through to see Kevin’s tell-all story.

Comments closed

What’s New in SynapseML

Published 2022-08-11 by Kevin Feasel

Nellie Gustafsson and Mark Hamilton share an update:

SynapseML is a massively scalable (feel free to spin up hundreds of machines!) machine learning library built on Apache Spark. SynapseML makes it easy to train production-ready models to solve problems from simple classification and regression to anomaly detection, translation, image analysis, speech to text, and just about any ML challenge you are facing. Under the hood, SynapseML integrates a wide array of ML technologies such as LightGBM, Vowpal Wabbit, ONNX, and the Cognitive Services into a single easy to use API compatible with MLFlow. We know, we know, everyone hates when developers invent new APIs, but you can rest easy because SynapseML integrates cleanly into existing Spark ML APIs so you can embed models directly into existing pipelines. We strive to make SynapseML available to developers wherever they work, and the library is available in a variety of languages like Python, Scala, Java, R. As of this release SynapseML is also usable from .NET, C#, F#.

Saving the best language for last, I see. Click through for the list of updates.

Comments closed

Database-Driven Parameterization for Synapse Pipelines

Published 2022-08-11 by Kevin Feasel

Paul Hernandez does some configuring:

Particularly in Synapse, there are even no global parameters like in Azure Data Factory.
When you want to move your development to another environment, typically CI/CDs pipelines are used. These pipelines consume an ARM template together with its parameter file to create a workspace in a target environment. The parameters can be overriding in the CD pipeline as explain here: https://techcommunity.microsoft.com/t5/data-architecture-blog/ci-cd-in-azure-synapse-analytics-part-4-the-release-pipeline/ba-p/2034434
Even so, I have not found a proper way to change the values of a pipeline parameter (the same for data flows and datasets parameters). I saw some custom parameters manipulation to set the default value of a parameter and then deploy it without any value, or even JSON manipulation with PowerShell (the dark side for me).

Read on for an alternative solution which does the job well.

Comments closed

Feeding Synapse Spark Info to On-Prem Kafka Clusters

Published 2022-08-04 by Kevin Feasel

Bhadreshkumar Shiyal finds a solution:

Microsoft’s official documentation for Azure Data Factory contains a tutorial which explains how to access an On-Premises SQL Server from Azure Data Factory which is inside a Managed Vnet. You can go through that article here: Access on-premises SQL Server from Data Factory Managed Vnet using Private Endpoint – Azure Data Fac….
Although based upon the article’s solution, to meet our requirements we needed to substitute On-Prem Apache Kafka for On-Prem SQL Server and instead of an Azure Data Factory inside a Managed Vnet, we used a Synapse Workspace inside a Managed Vnet. The “Forwarding Vnet” concept explained in the above tutorial remains as-is in our approach.

As soon as you turn on Data Exfiltration Protection (DEP), the lockdown is real. Click through to see what the process of exfiltrating data through an approved mechanism looks like.

Comments closed

Azure Synapse Analytics July 2022 Updates

Published 2022-08-02 by Kevin Feasel

Ryan Majidimehr notes that the Azure Synapse Analytics team has been busy:

Azure Synapse Link for SQL is an automated system for replicating data from your transactional databases into a dedicated SQL pool in Azure Synapse Analytics. Starting this month, you can make trade-offs between cost and latency in Synapse Link for SQL by selecting the continuous or batch mode to replicate your data.
By selecting “continuous mode”, the runtime will be running continuously so that any changes applied to the SQL database or SQL Server will be replicated to Synapse with low latency. Alternatively, when you select “batch mode” with a specified interval, the changes applied to the SQL database or SQL Server will be accumulated and replicated to Synapse in batch mode with the specified interval. This can save cost as you are only charged for the time the runtime is required to replicate data. After each batch of data is replicated, the runtime will be shut down automatically.

Click through for the complete list.

Comments closed

Updating Synapse Linked SQL Servers with Azure DevOps

Published 2022-07-27 by Kevin Feasel

Kevin Chant makes a change:

This post covers how to update both ends of Azure Synapse Link for SQL Server 2022 using Azure DevOps. As shown at the Data Toboggan conference.
By the end of this post you will know how to deploy database updates to both the SQL Server database and the Azure Synapse dedicated SQL Pool that are used as part of Azure Synapse Link for SQL Server 2022, using a pipeline in Azure DevOps. To keep them consistent.

Click through for the process.

Comments closed

Power BI and Synapse Book Roundup

Published 2022-07-25 by Kevin Feasel

Chris Webb checks out some books:

I like free stuff and I like books, so of course I like free books – and it seems that the more I provide free publicity for relevant books here the more free books I get sent. I’ve now got enough to merit writing another post covering those I’ve received recently from various publishers and authors. As always these are not reviews, just short summaries of books you might want to check out.

Read on for the not-reviews.

Comments closed

Pre-Processing Data Explorer Data with Spark

Published 2022-07-22 by Kevin Feasel

Hauke Mallow does some data engineering:

We often see customer scenarios where historical data has to be migrated to Azure Data Explorer (ADX). Although ADX has very powerful data-transformation capabilities via update policies, sometimes more or less complex data engineering tasks must be done upfront. This happens if the original data structure is too complex or just single data elements being too big, hitting data explorer limits of dynamic columns of 1 MB or maximum ingest file-size of 1 GB for uncompressed data (see also Comparing ingestion methods and tools) .
Let’s think about an Industrial Internet-of-Things (IIoT) use-case where you get data from several production lines. In the production line several devices read humidity, pressure, etc. The following example shows a scenario where a one-to-many relationship is implemented within an array. With this you might get very large columns (with millions of device readings per production line) that might exceed the limit of 1 MB in Azure Data Explorer for dynamic columns. In this case you need to do some pre-processing.

Click through to see how you can do this with an Azure Synapse Analytics Spark pool prior to ingesting it with a Data Explorer pool.

Comments closed

CETAS and the Serverless SQL Pool

Published 2022-07-22 by Kevin Feasel

Liliam Leme takes us through the Create External Table as Select operation in the Azure Synapse Analytics serverless SQL pool:

Serverless SQL pool has a very cool way to handle CREATE EXTERNAL TABLE AS SELECT (CETAS). I mean, you can use a serverless SQL pool to create a CETAS which will materialize the query results. It means a heavy query in which, for example, the results would be part of future large join with other queries or aggregations that can be consolidated for reporting purposes. Those scenarios are examples that could be used as a CETAS. CETAS will be stored as a parquet file accessed by an external table in your storage and the performance is awesome. Later I am suggesting how to recreate the CETAS using pipeline against serverless SQL Pool.

Click through to see it in action.

Comments closed

Mounting Data Lake Storage from a Spark Pool

Published 2022-07-19 by Kevin Feasel

Kamil Nowinski runs into some trouble:

Last weekend, I played a bit with Azure Synapse from a way of mounting Azure Data Lake Storage (ADLS) Gen2 in Synapse notebook within API in the Microsoft Spark Utilities (MSSparkUtils) package. I wanted to just do a simple test, hence I followed the documentation from Microsoft: How to use file mount/unmount API in Synapse.
Having an ADLS Account already created in a subscription – should be easy peasy, right?

Read on to understand when things might be a little more complicated than they seem. And more frustrating, once you see the cause of the problem.

Comments closed

M	T	W	T	F	S	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Category: Synapse Analytics

Azure Synapse Link for SQL Server 2022 and File Analysis

What’s New in SynapseML

Database-Driven Parameterization for Synapse Pipelines

Feeding Synapse Spark Info to On-Prem Kafka Clusters

Azure Synapse Analytics July 2022 Updates

Updating Synapse Linked SQL Servers with Azure DevOps

Power BI and Synapse Book Roundup

Pre-Processing Data Explorer Data with Spark

CETAS and the Serverless SQL Pool

Mounting Data Lake Storage from a Spark Pool