Press "Enter" to skip to content

Category: Synapse Analytics

Named Entity Encryption in Spark

Arshad Ali wants to secure some data being used in a Synapse Spark pool:

As a data engineer, we often get requirements to encrypt, decrypt, mask, or anonymize certain columns of data in files sitting in the data lake when preparing and transforming data with Apache Spark. The extensibility feature of Spark allows us to leverage a library which is not native to Spark. One such library is Microsoft Presidio, which provides fast identification and anonymization modules for private entities in text such as credit card numbers, names, locations, social security numbers, bitcoin wallets, US phone numbers, financial data, and more. It facilitates both fully automated and semi-automated PII (Personal Identifiable Information) de-identification and anonymization flows on multiple platforms.

In this blog post, I am going to demonstrate step by step how to download and use this library to meet the above requirements with Spark pool of Azure Synapse Analytics.

Read on to see how it works.

Comments closed

Interacting with Microsoft Graph API via Synapse

Paul Hernandez starts a new series:

In this and the next post I want to show you how to connect to the Microsoft Graph API, request some data, process it and store it in a database using Synapse Analytics. 

This first post presents a sample use case, briefly introduces the Graph API, how to create a linked service to it, and how to start querying data. In the next post a sample  Synapse pipeline will be described. The pipeline grabs some data and copies it into some target tables. Finally, I will create a sample query to showcase the newly imported data. 

Because there’s some potential confusion to people, Graph API is completely different from the idea of graph databases.

Comments closed

Azure Synapse Analytics September 2022 Update

Ryan Majidimehr has an update for us:

Serverless SQL pool relies on statistics to generate an optimal query execution plan and resource allocation. Previously, you had to manually create statistics for your CSV datasets when querying with OPENROWSET to get optimal performance due to the lack of CSV auto-statistics feature. With this improvement, serverless SQL pool will automatically create statistics for CSV datasets when needed to ensure an optimal query execution plan is generated. This feature is currently enabled for OPENROWSET only. 

How much performance improvement CSV statistics will bring depends on query complexity and the amount of data processed.

Click through for the full list of updates.

Comments closed

MERGE in Dedicated SQL Pools

Emily Tehrani notes an addition to Azure Synapse Analytics:

We are thrilled to announce that the MERGE T-SQL command for Azure Synapse Dedicated SQL pools is now Generally Available! MERGE has been a highly requested addition to the Synapse T-SQL library that encapsulates INSERTs/UPDATEs/DELETEs into a single statement, drastically easing migrations and making table synchronization a breeze.

If you do decide to use this, I’d expect it to have the same bugs which make its use on-premises a mess. As always, MERGE responsibly.

Comments closed

CI/CD for Synapse Link for SQL Server 2022

Kevin Chant makes some changes:

In another post I showed how you can use CI/CD to update both ends of Azure Synapse Link for SQL Server 2022 using Azure DevOps. Allowing you to update both a SQL Server 2022 database and an Azure Synapse Analytics dedicated SQL Pool in the same deployment pipeline.

By my own admission, that method can become complex. Plus, I showed some more advanced concepts in that post. With this in mind, I have decided to cover an easier way in this post.

Read on for the simpler technique.

Comments closed

Rebuilding a Dedicated SQL Pool via Azure DevOps

Sarath Sasidharan clones an Azure Synapse Analytics dedicated SQL pool:

There are many scenarios where you want to create a new Synapse dedicated SQL pool environment based on an existing Synapse dedicated SQL pool environment. This may be required when you need to create a development or test environment based on your production environment by copying complete schemas and without copying data.

Note that this process won’t move the data itself—given that you’re starting with terabytes for an effective dedicated SQL pool, trying to create a bacpac would be an exercise in misery.

Comments closed

Tips for using Synapse Database Templates

James Serra provides some guidance:

I had previously blogged about Azure Synapse Analytics database templates, and wanted to follow-up with some notes and tips on that feature as I have been involved on a project that is using it:

– Purview does not yet pull in the metadata for database templates (table/field descriptions and table relationships). Right now it pulls in the metadata as if it was a SQL table or as if it was a file in ADLS. Both just have the basic information supported by those types. The SQL one is probably preferred

– Power BI does not import the table and field descriptions when connecting to a lake database (where the database templates are stored), but it does import the table relationships. You can see the table descriptions by hovering over the table names in the navigator when importing tables using the “Azure Synapse Analytics workspace (Beta)” connector. Note you are not able to see the table descriptions when hovering over the table names using the “Azure Synapse Analytics SQL” connector. Also note the “Select Related Tables” button does not work in the navigator

Click through for more notes from the field.

Comments closed

Loading the Synapse Data Explorer Pool

Gauri Mahajan loads some event data for analysis:

In my previous article, Getting started with Data Explorer pools in Azure Synapse, we learned how to create Data Explorer pools in Azure Synapse and the unique value that Data Explorer brings to semi-structured and free-text data. The creation of the Data Explorer pool is the first step in the process. After the pool is created, one can create data structures, ingest data and then use it for consumption. Data Explorer pool interface provides different ways of ingesting data into the pool including one-click ingestion.

Some of these techniques are really straightforward. Others—especially if you’re talking about large amounts of data—do require installing and working with local tooling.

Comments closed

Importing Delta Tables into a Synapse Dedicated SQL Pool

Mark Pryce-Maher does a bit of integration:

In June, Databricks announced that they are open sourcing Delta Lake 2.0Delta Lake is quickly becoming the format of choice in data science and data engineering.

To import Delta Lake into a Synapse dedicated SQL Pool you would need Azure Data Factory/Synapse Pipelines or Spark to handle the Delta Lake files.


This is not ideal because it adds extra overheads of complexity, time, and costs.

As an intellectual challenge, I wondered if it’s possible to import Delta Lake files directly into the dedicated SQL Pool and support features like time-travel. It turned out to be a great little project and a great way of learning about Delta Lake.

This turned out to be a bit more difficult than I would have imagined. Click through for the script and check the comments as well for a preview of upcoming attractions.

Comments closed

Using a Service Principal Account for Power BI + Dedicated SQL Pool

Dan English provides a hookup:

In this post I will go over a topic that is frequently asked about and that is using a Service Principal account with Power BI when connecting to data sources. Currently today none of the built-in connectors support this capability natively, but the SQL Server ODBC driver does support the use of a Service Principal account. The one caveat with using an ODBC driver with Power BI is that a gateway would be required once the report is published to the service.

Read on for the step-by-step process.

Comments closed