Press "Enter" to skip to content

Category: Synapse Analytics

Data Exfiltration Protection and Pip

I have a post borne from frustration:

I have an Azure Synapse Analytics workspace which uses a managed virtual network and includes data exfiltration protection. I also have a Spark pool. My goal is to import a few packages and use them in a Spark notebook.

Doing so is pretty easy from the Synapse workspace. I navigate to the Manage hub and then choose Apache Spark pools from the Analytics pools menu. Select the ellipsis for my Spark pool and then choose Packages.

From there, because I plan to update Python packages, I can upload a requirements.txt file and have Pip do its job.

But then it doesn’t… Click through to learn why, as well as the workaround for this. It’s stuff like this which makes me say data exfiltration protection is a feature administrators will (mostly) like and developers will hate. Especially because there’s no obvious indicator why this was happening in the error message itself.

Comments closed

Creating a Synapse Workspace with Data Exfiltration Protection

I have a post on creating a new Azure Synapse Analytics workspace:

As a quick upshot, having a managed VNet set up means that any Spark pools you create will have subnet segregation, meaning that the Spark machines will be in their own subnet, away from everything else. This provides a bit of cross-pool protection for you automatically. It also performs similar network isolation for your Synapse workspace, keeping it separated from other workspaces. The other big thing it does is create managed private endpoints to the serverless and dedicated SQL pools, which means that any network traffic between these pools and resources in the Synapse workspace will be guaranteed to transit over Azure networks and not the public internet, at least until it gets to you hitting the web.azuresynapse.net URL (and there are additional methods to lock down that part of it that we won’t cover today).

By default, the portal will not create a managed virtual network, so you’ll need to enable it at creation time. You cannot enable or disable the managed virtual network setting after a workspace has been created, so if you make a mistake, you’d need to rebuild the workspace, though you can at least use the same storage account.

One last thing that managed virtual networks offer you is the ability to enable data exfiltration protection.

Click through to see how it all works. Data exfiltration protection can limit you a bit, and that can be quite frustrating, but it does what it says…in the same way that Draconis did what he said.

Comments closed

Azure Synapse Analytics: Success by Design

Wolfgang Strasser digs up some documents:

Today, I stumbled upon a very interesting link – the Azure Synapse Analytics – Success by Design site (follow this link).

If you need guidance, best practices links, POC playbooks, links to blogs & videos, tools, .. THIS is the site you need to bookmark.

Click through for a bit more information, as well as links to other relevant Azure Synapse Analytics resources.

Comments closed

Dedicated SQL Pool Index, Distribution, and Partition Guidance

I have a write-up on the specific value of distributions, indexes, and partitions in Azure Synapse Analytics dedicated SQL pools:

Not too long ago, I ended up taking the DP-203 certification exam for sundry reasons. On that exam, they ask a lot about Azure Synapse Analytics, including indexing, distribution, and partitioning strategies. Because these can be a bit different from on-premises SQL Server, I wanted to cover what options are available and when you might choose them. Let’s start with distributions, as that’s the biggest change in thought process.

Read on for the guidance.

Comments closed

Azure Synapse Analytics Updates

Saveen Reddy catalogs what’s new in Azure Synapse Analytics:

Quick Reuse of Spark clusters

By default, every data flow activity spins up a new Spark cluster based upon the Azure Integration Runtime (IR) configuration. Cold cluster start-up time takes a few minutes. If your pipelines contain multiple sequential data flows, you can enable a time-to-live (TTL) value, which keeps a cluster alive for a certain period of time after its execution completes. If a new job starts using the IR during the TTL duration, it will reuse the existing cluster and start up time will be greatly reduced.

Read on for the full list of updates.

Comments closed

A Primer on the Serverless SQL Pool

Tino Zishiri has some tips for people trying out the Azure Synapse Analytics serverless SQL pool:

Serverless SQL Pools or SQL on-demand is a serverless distributed data processing service offered by Microsoft. The service is comparable to Amazon Athena. The serverless nature of the service means that there is no infrastructure to manage, and you only pay for what you use (pay-per-query model).
Through Serverless SQL pools, you query the data in your data lake using T-SQL. The architecture behind the service is optimized for querying and analyzing big data by running queries in parallel.

Read on to understand where the serverless SQL pool fits, as well as some tips about data transformation with this pool.

Comments closed

Microsoft.DataFactory and Storage Event Triggers in Synapse

Cathrine Wilhelmsen troubleshoots an Azure issue:

I ran into an issue today while trying to publish a storage event trigger in Azure Synapse Analytics. After publishing, I got error messages that said “failed to subscribe” and “failed to activate”. The storage event trigger had been published, but it wouldn’t start. Help!

Click through for some resources on documentation, a few things which didn’t work, and what finally resolved the issue.

Comments closed

Azure Synapse Analytics November Updates

James Serra keeps us up to date on Synapse:

Delta Lake support for serverless SQL is generally available: Azure Synapse has had preview-level support for serverless SQL pools querying the Delta Lake format. This enables BI and reporting tools to access data in Delta Lake format through standard T-SQL. With this latest update, the support is now Generally Available and can be used in production. See How to query Delta Lake files using serverless SQL pools

Click through for the full list of what James likes.

Comments closed

Updates in Azure Synapse Analytics

Saveen Reddy shows how the Synapse product team has been busy this year:

Previously, Synapse workspaces had a kind of database called a Spark Database. Spark databases had two key characteristics:

– Tables in Spark databases kept their underlying data in Azure Storage accounts (i.e. data lakes)

– Tables in Spark databases could be queried by both Spark pools and by serverless SQL pools.

To help make it clear that these databases are supported by both Spark and SQL and to clarify their relationship to data lakes, we have renamed Spark databases to Lake databases. Lake databases work just like Spark databases did before. They just have a new name.

Okay, this is the kind of change I can do without. That’s a really dumb name. Spark databases tell you what a thing is. It’s a database which lives in Apache Spark. Lake databases run what? Apache Spark. But if anything really should be called a Lake database, it’d be a serverless SQL pool’s database because everything in there is built on top of the data lake—it’s all external tables pointing to a lake. So calling a Spark database a Lake database brings more confusion than elucidation.

Most of the other changes on that list? Really cool. This one? Not at all.

Comments closed