Press "Enter" to skip to content

Category: Synapse Analytics

Azure Synapse Analytics: Success by Design

Wolfgang Strasser digs up some documents:

Today, I stumbled upon a very interesting link – the Azure Synapse Analytics – Success by Design site (follow this link).

If you need guidance, best practices links, POC playbooks, links to blogs & videos, tools, .. THIS is the site you need to bookmark.

Click through for a bit more information, as well as links to other relevant Azure Synapse Analytics resources.

Comments closed

Dedicated SQL Pool Index, Distribution, and Partition Guidance

I have a write-up on the specific value of distributions, indexes, and partitions in Azure Synapse Analytics dedicated SQL pools:

Not too long ago, I ended up taking the DP-203 certification exam for sundry reasons. On that exam, they ask a lot about Azure Synapse Analytics, including indexing, distribution, and partitioning strategies. Because these can be a bit different from on-premises SQL Server, I wanted to cover what options are available and when you might choose them. Let’s start with distributions, as that’s the biggest change in thought process.

Read on for the guidance.

Comments closed

Azure Synapse Analytics Updates

Saveen Reddy catalogs what’s new in Azure Synapse Analytics:

Quick Reuse of Spark clusters

By default, every data flow activity spins up a new Spark cluster based upon the Azure Integration Runtime (IR) configuration. Cold cluster start-up time takes a few minutes. If your pipelines contain multiple sequential data flows, you can enable a time-to-live (TTL) value, which keeps a cluster alive for a certain period of time after its execution completes. If a new job starts using the IR during the TTL duration, it will reuse the existing cluster and start up time will be greatly reduced.

Read on for the full list of updates.

Comments closed

A Primer on the Serverless SQL Pool

Tino Zishiri has some tips for people trying out the Azure Synapse Analytics serverless SQL pool:

Serverless SQL Pools or SQL on-demand is a serverless distributed data processing service offered by Microsoft. The service is comparable to Amazon Athena. The serverless nature of the service means that there is no infrastructure to manage, and you only pay for what you use (pay-per-query model).
Through Serverless SQL pools, you query the data in your data lake using T-SQL. The architecture behind the service is optimized for querying and analyzing big data by running queries in parallel.

Read on to understand where the serverless SQL pool fits, as well as some tips about data transformation with this pool.

Comments closed

Microsoft.DataFactory and Storage Event Triggers in Synapse

Cathrine Wilhelmsen troubleshoots an Azure issue:

I ran into an issue today while trying to publish a storage event trigger in Azure Synapse Analytics. After publishing, I got error messages that said “failed to subscribe” and “failed to activate”. The storage event trigger had been published, but it wouldn’t start. Help!

Click through for some resources on documentation, a few things which didn’t work, and what finally resolved the issue.

Comments closed

Azure Synapse Analytics November Updates

James Serra keeps us up to date on Synapse:

Delta Lake support for serverless SQL is generally available: Azure Synapse has had preview-level support for serverless SQL pools querying the Delta Lake format. This enables BI and reporting tools to access data in Delta Lake format through standard T-SQL. With this latest update, the support is now Generally Available and can be used in production. See How to query Delta Lake files using serverless SQL pools

Click through for the full list of what James likes.

Comments closed

Updates in Azure Synapse Analytics

Saveen Reddy shows how the Synapse product team has been busy this year:

Previously, Synapse workspaces had a kind of database called a Spark Database. Spark databases had two key characteristics:

– Tables in Spark databases kept their underlying data in Azure Storage accounts (i.e. data lakes)

– Tables in Spark databases could be queried by both Spark pools and by serverless SQL pools.

To help make it clear that these databases are supported by both Spark and SQL and to clarify their relationship to data lakes, we have renamed Spark databases to Lake databases. Lake databases work just like Spark databases did before. They just have a new name.

Okay, this is the kind of change I can do without. That’s a really dumb name. Spark databases tell you what a thing is. It’s a database which lives in Apache Spark. Lake databases run what? Apache Spark. But if anything really should be called a Lake database, it’d be a serverless SQL pool’s database because everything in there is built on top of the data lake—it’s all external tables pointing to a lake. So calling a Spark database a Lake database brings more confusion than elucidation.

Most of the other changes on that list? Really cool. This one? Not at all.

Comments closed

Organizing Synapse Workspaces and Lakehouses

Jovan Popovic confirms that Microsoft is using the term “Lakehouse” like Databricks does:

The lakehouse pattern enables you to keep a large amount of your data in Data Lake and to get the analytic capabilities without a need to move your data to some data warehouse to start an analysis. A lakehouse represents a good trade-off between query performance and the ability to access the latest version of data without the need to wait for data to be reloaded.

Azure Synapse Analytics workspace enables you to implement the Lakehouse pattern on top of Azure Data Lake storage.

When you think about your lakehouse solution, be aware that there are two options for creating databases over the lake:

Lake databases that are created using Spark or database template

SQL databases that are created using serverless SQL pools on top of data lake.

Although you might use different tools and languages to create these types of databases, the principles described in this article apply to both types. I will use the term “lakehouse” whenever i reference Spak Lake database or SQL database created using the serverless SQL pools.

Click through for Jovan’s guidance.

Comments closed

Data Types Matter, Even in the Serverless SQL Pool

Jovan Popovic has a public service announcement for us:

The serverless SQL pool is a distributed computing system that executes concurrent queries on a set of distributed compute nodes. Multiple compute nodes are running the parts of a distributed query plan that read the underlying files, join the data sets, group, and aggregate results. Different queries might try to use the same compute nodes to execute the parts of the queries.

The oversized column types like VARCHAR(MAX) might trick the compute node to allocate more resources than is needed. However, the allocation is based on the estimate, but these over-allocated resources will not be used in actual execution because they are not needed. If a compute node needs 100MB to sort the results it will use these 100MB although the query optimizer allocated 4GB of memory for the task on the compute node.

Read the whole thing.

Comments closed