Press "Enter" to skip to content

Category: Cloud

Approximate Percentiles in SQL DB and SQL MI

Balmukund Lakhani has an announcement:

Approximate query processing was introduced to enable operations across large data sets where responsiveness is more critical than absolute precision. Approximate operations can be used effectively for scenarios such as KPI and telemetry dashboards, data science exploration, anomaly detection, and big data analysis and visualization. Approximate query processing family has enabled a new market of big data HTAP customer scenarios, including fast-performing dashboard and data science exploration requirements.  

Today, we are announcing preview of native implementation of APPROX_PERCENTILE in Azure SQL Database and Azure SQL Managed Instance. This function will calculate the approximated value at a provided percentile from a distribution of numeric values.

This is way faster than using the PERCENTILE_CONT() or PERCENTILE_DISC() window functions. For a decent-sized query, I was getting anywhere from 5-20x performance improvements, and the larger the dataset, the bigger the gains. It is important to note that the approximate percentiles are not window functions, so you don’t get one row back per row of input.

Comments closed

Monitoring the Serverless SQL Pool via Log Analytics

Sidney Cirqueira shows how to monitor SQL requests in Azure Synapse Analytics:

Today I would like to share a scenario that I was working on one of my serverless SQL Pool support cases. The customer asked for an advice on how to monitor serverless SQL requests by using log analytics.

The intention of this guide is to help you with choosing the configuration required to easily setup the Synapse Analytics Workspace monitoring and all other considerations about how to monitor serverless SQL requests with Azure Monitor. Spoiler: At the end of this article, I will share the latest version of the serverless workbook posted on the Azure_Synapse_Tool_Box. This includes a really cool way to see query execution information.

Read on for that and definitely check out the Azure Synapse Toolbox if you’re a Synapse user.

Comments closed

Pattern Learning in Amazon SageMaker

Vishaal Kapoor, et al, take us through an example of pattern learning in Amazon SageMaker:

Pattern learning automatically analyzes your data and surfaces textual constraints that may apply to your dataset. For the example with phone numbers, pattern learning can analyze the data and identify that the vast majority of phone numbers follow the textual constraint [1-9][0-9]{2}-[0-9][4]. It can also alert you that there are examples of invalid data so that you can exclude or correct them.

In the following sections, we demonstrate how to use pattern learning in Data Wrangler using a fictional dataset of product categories and SKU (stock keeping unit) codes.

Read on for the scenario.

Comments closed

Transferring Data between Dedicated SQL and Spark Pools in Synapse

Sidney Cirqueria shows off a connector available to us in Azure Synapse Analytics:

Usually, customers do this kind of operation using Synapse Apache Spark to load data to Dedicated Pool within Azure Synapse Workspace, but today, I would like to reproduce a different scenario that I was working on one of my support cases.  Consider a scenario where you are trying to load data from Synapse Spark to Dedicated pool (formerly SQL DW) using Synapse Pipelines, and additionally you are using Synapse Workspace deployed with Managed Virtual Network.

The intention of this guide is to help you with which configuration will be required if you need to load data from Azure Synapse Apache Spark to Dedicated SQL Pool (formerly SQL DW). If you prefer take advantage of the new feature-rich capabilities now available via the Synapse workspace and Studio and load data directly from Azure Apache Spark to Dedicated Pool in Azure Synapse Workspace is recommended that you enable Synapse workspace features on an existing dedicated SQL pool (formerly SQL DW).

Read on for a few tips a nd a step-by-step walkthrough of the process.

Comments closed

Azure SQL DB GP Compute Optimized Performance Comp

Reitse Eskens continues a series on comparing the performance of different Azure SQL DB tiers:

The compute optimized tier starts at 8 cores as a minimum and goes up to 72.
The maximum storage starts at 1 TB and  goes up to 4 TB. The number of TempDB files is undisclosed but the size starts at 37 GB and goes up to 333 GB. A huge difference with the regular provisioned one! The disk limitations are disclosed as well. Log rate starts at 36 Mbps and maxing out at 50. The same goes for the data; starting at 2560 iops and maxing out at 12.800. One iop (or Input Output oPeration) is connected to the disk cluster size. As these are 4 kb, reading or writing one data page (8Kb) equals two iops. 2560 Iops equals something of 10 MB per second. The top end goes to about 50 MB per second. Keep this in mind when you’re working out which tier you need, because usually disk performance can be essential.

Read on to see how it performs on Reitse’s standardized test workload.

Comments closed

Granular Billing for Azure Data Factory

Chenye Charlie Zhu announces a new feature:

By default, Azure Data Factory reports lump sum charges for billing, meaning that at the factory level, we add up charges across all pipelines within a factory, and tell you how much you have spent on these pipelines. In many cases, these aggregate numbers should suffice. But in others, these numbers lack the clarity and transparency that we thrive to provide customers. For instance, if you are running data pipelines for multiple teams, you may want to determine the cost for each pipeline, for proper book-keeping and/or charge backs.

Now, Azure Data Factory will help you with this endeavor, with built-in per pipeline detailed billing view. Moreover, we built the feature on top of the Azure Billing and Cost Analysis platform, allowing you to stay with the cost and budget management tool that you are familiar with to identify spending trends and spot where overspending might have occurred.

Great if you have half a dozen pipelines. Probably less great if you have 500.

Comments closed

Geo-Zone Redundant Storage for SQL MI Backups

Niko Neugebauer moves the backups pretty far away:

The new Geo-Zone Redundant Storage (GZRS) backup storage option combines the best of two worlds – Geo-Redundant and Zone-Redundant storage, keeping backups safe from both regional (Geo-Redundant) and Data Center (Zone-Redundant) failures. It provides the highest availability for storage currently offered on Azure, improving recovery speed and enabling Point-In-Time Restore (PITR) of backups in the event of a zone failure.  

Geo-Zone Redundant Storage for Azure SQL Managed Instance backups provides 3 synchronous copies in different availability zones within the same primary region, plus an additional asynchronous copy within a single availability zone in the paired secondary region, as shown on the following picture: 

Click through for that picture and what it does for expected availability. Basically, a whole bunch of data centers would need to fail before you lose a backup. Or someone messes up DNS and makes everything unavailable for a day, not that that’s ever happened before with a large cloud service provider…

Comments closed

Running Jobs in Azure SQL Database

Etienne Lopes gets a job:

In SQL Server, many automated tasks are implemented via SQL Server Agent jobs. If you’re used to it, you may get a bit sad when you learn that SQL Server Agent is not present in Azure SQL Database. However if those automated tasks rely on T-SQL then they can still be easily implemented in Azure SQL Databases using elastic jobs that provide the ability to run T-SQL scripts on a schedule or on-demand

Read on to see how they work.

Comments closed

General Purpose Tier Azure SQL DB Performance

Reitse Eskens continues a series on comparing tiers of Azure SQL Database:

In my previous blog, I wrote about the serverless tier, the one that can go to sleep if you’re not using it for more than one hour (minimum). That tier is cheaper as long as you’re not running it for more than 25% of the time. If you need more time, go provisioned.
Another difference between serverless and provisioned is that the provisioned one gets a set number of cores whereas the serverless one has a minimum and a maximum number of cores. So this time, the blog is about the provisioned tier where you choose a fixed number of CPU’s with a fixed monthly cost.

Click through for the analysis. I’ll reiterate here that I really hope Reitse has some graphics at the end (or at least tables) which sort of lay out where the boundaries between tiers are and what the performance and cost profiles look like between them.

Comments closed