Press "Enter" to skip to content

Author: Kevin Feasel

Lessons Learned from the Serverless SQL Pool

Teo Lachev has some thoughts for us:

I’ve architected and currently implementing a solution that uses Synapse (my last newsletter has the details, plus the architecture diagram). Synapse Serverless is the Microsoft answer to Amazon Athena but instead of using open-source tools like Presto, it’s built on SQL Server. In this project we extract many tables from 1,500 on-prem SQL Server databases and stage them in ADLS.

Read on for Teo’s notes on the topic.

Comments closed

Azure Synapse Analytics Database Templates

Santosh Balasubramanian shows off database templates in Azure Synapse Analytics:

Azure Synapse Analytics is a limitless analytics service that brings together data integration, enterprise data warehousing, and big data analytics. It gives you the freedom to query data on your terms, using either serverless or dedicated resources—at scale. Azure Synapse brings these worlds together with a unified experience to ingest, explore, prepare, manage, and serve data for immediate BI and machine learning needs.

One of the challenges that users in key industry areas face is how to describe and shape the mass of data that they are gathering. Most of this data is currently stored in data lakes or in application-specific data silos. The challenge is to bring all this data together in a standardized format enabling it to be more easily analyzed and understood and for ML and AI to be applied to it.

Azure Synapse solves this problem by introducing industry-specific templates for your data, providing a standardized way to store and shape data. These templates provide schemas for predefined business areas, enabling data to be loaded into a database in a structured way.

Read on to see what they can do, and try them out in a Synapse workspace.

Comments closed

Finding Queries with Index Hints

Aaron Bertrand wants to find queries using index hints:

Index hints should be a last resort, but we all know how that tends to go in SQL Server. If you have committed to using index hints, be aware that at some point you may have to update those explicit references. A specific index may get renamed or dropped, making the hints invalid, or a new and better index may be created. How do we find these references to remove, update, or at least document them? I see a lot of queries out there that will search the plan cache for index hints (often using fuzzy matching), or that just happen to use a specific index, but what about queries with explicit index hints that aren’t currently in the plan cache, and is index usage really what we’re after?

Click through for a detailed investigation, and also congratulations to Aaron for landing at Stack Overflow.

Comments closed

Using the Dedicated Admin Connection

Reitse Eskens has the key for the back door:

I got a call this morning from a coworker. One of the database instances was unreachable with the message that the TempDB log file was full. No processes could login and the only way to get things going again might be to restart either server or instance. Restarting the instance should be the last resort, because after a restart SQL Server will plough through all the logfiles to find transactions to either roll forward or roll back. That will take more time that you want. But, we had the famous DAC backdoor installed.

Read on to see how the DAC can save the day.

Comments closed

Finding Processes Waiting on Spinlocks

David Fowler helps us find processes waiting on spinlocks:

Spinlock contention is always a real headache to deal with. I recently saw an issue when spinlock contention on SOS_CACHESTORE was making the server virtually unresponsive. The issue was very intermittent with no obvious pattern but the assumption is that it was caused by a particular process in the application. Finding what that process was the tricky part, they don’t show up as waiting tasks so your usual scripts for looking for waiting processes may not work here.

It’s almost never spinlocks, but when it is, this is how you figure it out.

Comments closed

Thinking like an Escalation Engineer

Stacy Gray shares stories:

“You new?” asked with an amused grin.

“Yes,” I replied floating 2 inches off the ground with a huge, toothy smile.

“Which team?”

“SQL!”

“Good luck.”

I glanced at the badge.  It was blue.  My opportunity to get some secret, inside wisdom!

“I want to become a blue badge.  Do you have any advice on that?” The elevator doors opened.

“Solve your own cases,” was the reply.

Read on for stories, advice, and more.

Comments closed

Azure Synapse Analytics Shared Security

Hiram Fleitas explains the value of workspace and storage account segregation in Azure Synapse Analytics:

Well, why?… perhaps you prefer not spinning more resources to segmentate the environment or decouple the workloads, but you still need to enforce data security across domains.

Lets look at how to secure an HR container in a shared Azure Synapse Analytics workspace that serves mixed workloads by using only RBAC permissions at the storage, and at container level.

It’s recommended to use a separate storage account. I will explain and demo why.

Click through for the demo and explanation.

Comments closed

Sort-Based Blocking Shuffle in Flink

Yingjie Cao and Daisy Tsang have a multi-part series on sort-based blocking shuffles in Apache Flink. Part 1 acts as an overview:

The hash-based blocking shuffle has been supported in Flink for a long time. However, compared to the sort-based approach, it can have several weaknesses:

1. Stability: For batch jobs with high parallelism (tens of thousands of subtasks), the hash-based approach opens many files concurrently while writing or reading data, which can give high pressure to the file system (i.e. maintenance of too many file metas, exhaustion of inodes or file descriptors). We have encountered many stability issues when running large-scale batch jobs via the hash-based blocking shuffle.

2. Performance: For large-scale batch jobs, the hash-based approach can produce too many small files: for each data shuffle (or connection), the number of output files is (producer parallelism) * (consumer parallelism) and the average size of each file is (shuffle data size) / (number of files). The random IO caused by writing/reading these fragmented files can influence the shuffle performance a lot, especially on spinning disks. See the benchmark results section for more information.

By introducing the sort-based blocking shuffle implementation, fewer data files will be created and opened, and more sequential reads are done. As a result, better stability and performance can be achieved.

Part 2 provides some design considerations:

As discussed above, the hash-based blocking shuffle would produce too many small files for large-scale batch jobs. Producing fewer files can help to improve both stability and performance. The sort-merge approach has been widely adopted to solve this problem. By first writing to the in-memory buffer and then sorting and spilling the data into a file after the in-memory buffer is full, the number of output files can be reduced, which becomes (total data size) / (in-memory buffer size). Then by merging the produced files together, the number of files can be further reduced and larger data blocks can provide better sequential reads.

Flink’s sort-based blocking shuffle adopts a similar logic. A core difference is that data spilling will always append data to the same file so only one file will be spilled for each output, which means fewer files are produced.

Check it out for a behind-the-scenes look at

Comments closed

Azure Synapse Data Explorer Pools

Manoj Raheja tries announces another pool type:

At Ignite, we announced the public preview of Azure Synapse data explorer that makes it possible to query huge amounts of structured, semi-structured, and free-text telemetry and time-series data. The following are some of the key capabilities that make this possible:

– Powerful distributed query engine that indexes all data including free text and semi-structured data. The data is automatically compressed, indexed, auto-optimized, and cached on local SSDs and persisted on storage. Compute and storage are decoupled that gives you full elasticity to auto scale in/out without a downtime.

– Intuitive Kusto Query Language (KQL) that is highly optimized for exploring raw telemetry and time series data using Synapse data explore’s best-in-class text indexing for efficient free-text search, regex, and parsing on traces\text data.

– Comprehensive JSON parsing capabilities for querying semi-structured data including arrays and nested structure.

– Native, advanced time series support for creation, manipulation, and analysis of multiple time series with in-engine Python and R execution support for model scoring.

Click through for a demonstration, showing that this is for more than just logs.

Comments closed

Restoring a Database from Backup in Docker

Chad Baldwin has a container in search of a database:

Yesterday, I was watching a Pluralsight course which provided a database .bak file to follow along with the examples. I generally like to use Docker when working with SQL Server locally…but as a somewhat novice user, I have found it to be a bit of a pain if you need to deal with restoring or attaching a database.

When I run into these scenarios, I usually spin up an AWS EC2 instance, install SQL server, and work with it that way. There’s probably a simpler way to do it using RDS or Azure, but I’m not familiar with those just yet. The other option is if I have a Linux machine at hand, I will use that with Docker and mapped volumes work great.

I do happen to have a Linux machine ready to use…but I was determined to figure out how to get this working on Windows.

Bonus points for using RESTORE DATABASE syntax. Every SQL Server user should know how to back up and restore a database using only T-SQL. That’s a skill which will definitely pay dividends.

Comments closed