Curated SQL – Page 469 – A Fine Slice Of SQL Server

Purview Access Policies and SQL Server 2022

Published 2022-08-12 by Kevin Feasel

Srdan Bozovic links Purview and SQL Server 2022:

The focus of this article is on using Microsoft Purview to enable access to user data as well as specific system metadata in SQL Server 2022 running on Azure Arc–enabled servers.
With the SQL Server 2022 release, the goal is to enable three main scenarios:
– Browsing data in user-defined tables and views.
– Performance monitoring with system commands, functions, and views.
– Security auditing with security-related system functions and views.

If Azure Arc-enabled servers are required for Purview to work, I think that will seriously hinder uptake.

Comments closed

Resolving tempdb Issues in Azure SQL DB

Published 2022-08-12 by Kevin Feasel

Holger Linke troubleshoots some problems:

The tempdb system database is a global resource available to users who are connected to Azure SQL Database or any instance of SQL Server. It holds temporary user objects that are explicitly created by a user or application, and internal objects that are created by the SQL Server database engine itself. The most common tempdb issue is running out of space, either regarding tempdb’s overall size quota or the transaction log.
The available tempdb space in Azure SQL Database depends on two factors: the service tier (pricing tier) that the database is configured with, and the type of workload that is executed against the database. These are also the main factors to control if you are running out of tempdb space.

Click through for several error cases and how we can resolve them.

Comments closed

When Estimated and Actual Plans Differ

Published 2022-08-12 by Kevin Feasel

Brent Ozar notes that estimated plans are not guarantees:

A reader posted a question for Office Hours:

Hi Brent, What is your take on Hugo Kornelis’s explanation of execution plan naming. As her his explanation, estimated exec plan is simply an execution plan whereas actual execution plan = execution plan+run-time stats. Do you agree that the naming is flawed and confusing? – Yourbiggestfan

Click through to see examples of when estimated plans might look different from actual plans.

Comments closed

What’s New in SynapseML

Published 2022-08-11 by Kevin Feasel

Nellie Gustafsson and Mark Hamilton share an update:

SynapseML is a massively scalable (feel free to spin up hundreds of machines!) machine learning library built on Apache Spark. SynapseML makes it easy to train production-ready models to solve problems from simple classification and regression to anomaly detection, translation, image analysis, speech to text, and just about any ML challenge you are facing. Under the hood, SynapseML integrates a wide array of ML technologies such as LightGBM, Vowpal Wabbit, ONNX, and the Cognitive Services into a single easy to use API compatible with MLFlow. We know, we know, everyone hates when developers invent new APIs, but you can rest easy because SynapseML integrates cleanly into existing Spark ML APIs so you can embed models directly into existing pipelines. We strive to make SynapseML available to developers wherever they work, and the library is available in a variety of languages like Python, Scala, Java, R. As of this release SynapseML is also usable from .NET, C#, F#.

Saving the best language for last, I see. Click through for the list of updates.

Comments closed

What’s in a Name?

Published 2022-08-11 by Kevin Feasel

Benjamin Smith analyzes a name change:

Recently, RStudio announced its name change to Posit. For many this name change was accepted with open arms, but for some-not so. Being the statistician that I am I decided to post a poll on LinkedIn to see the sentiment of my network. After running the poll for a week the results were in:

Read on for the responses as well as an analysis using RSTAN.

Comments closed

Data Retention: Definition and Policy

Published 2022-08-11 by Kevin Feasel

Joey Jablonski thinks about data retention:

Data retention policies should be defined in a way that they are easy to understand, easy to be implemented programmatically, and should enable engineering teams to operate independently most of the time when working with datasets that are known and already leveraged by the organization. In addition to policy definitions, data governance leaders should ensure changes are part of data literacy plans for training and rollout to ensure awareness across the organization.

This is something that most DBAs provide input into but don’t directly control. Still, it’s good to know some of the challenges around data retention and figure out how to apply it to your organization.

Comments closed

Cross-Platform SQL Server Availability Groups

Published 2022-08-11 by Kevin Feasel

Rajendra Gupta shows how to set up an Availability Group in SQL Server which runs on both Windows and Linux:

Microsoft supports SQL Server on Linux, and it has many of the same features as the Windows version. You can restore databases from Windows to Linux SQL or vice versa. The Linux SQL works with Red Hat, Ubuntu, SUSE enterprise, Kubernetes containers, and Docker.
Windows-based SQL instance supports SQL Server Always On Availability Groups for high availability and disaster recovery. If you are not familiar with Windows AG configuration, refer to the extensive series on Always on Availability Group (Toc at the bottom).
If you have both Windows and Linux SQL Server, is it possible to configure an availability group between them? Let’s explore this in this article.

This example uses async mode, which is the easier one to set up. With synchronous, you’re probably looking at using Pacemaker to sort out AG status.

2 Comments

Database-Driven Parameterization for Synapse Pipelines

Published 2022-08-11 by Kevin Feasel

Paul Hernandez does some configuring:

Particularly in Synapse, there are even no global parameters like in Azure Data Factory.
When you want to move your development to another environment, typically CI/CDs pipelines are used. These pipelines consume an ARM template together with its parameter file to create a workspace in a target environment. The parameters can be overriding in the CD pipeline as explain here: https://techcommunity.microsoft.com/t5/data-architecture-blog/ci-cd-in-azure-synapse-analytics-part-4-the-release-pipeline/ba-p/2034434
Even so, I have not found a proper way to change the values of a pipeline parameter (the same for data flows and datasets parameters). I saw some custom parameters manipulation to set the default value of a parameter and then deploy it without any value, or even JSON manipulation with PowerShell (the dark side for me).

Read on for an alternative solution which does the job well.

Comments closed

Sharing Power BI Content outside the Organization

Published 2022-08-11 by Kevin Feasel

Mara Pereira wants to share some data:

I am seeing more and more customers trying to use Premium capabilities to create data products that they can incorporate as part of their main product offering. This kind of reporting as a product solution will add a lot more value to their main product, so I can see why this is becoming quite trendy.
However, it became obvious that the current documentation can be a bit overwhelming and confusing at first.
So I decided to compile the process of sharing content outside of your organisation in a blog post. Happy days!

Click through to see how to share within the Power BI Service.

Comments closed

Anomaly Detection over Delta Live Tables

Published 2022-08-10 by Kevin Feasel

Avinash Sooriyarachchi and Sathish Gangichetty show off an interesting scenario:

Anomaly detection poses several challenges. The first is the data science question of what an ‘anomaly’ looks like. Fortunately, machine learning has powerful tools to learn how to distinguish usual from anomalous patterns from data. In the case of anomaly detection, it is impossible to know what all anomalies look like, so it’s impossible to label a data set for training a machine learning model, even if resources for doing so are available. Thus, unsupervised learning has to be used to detect anomalies, where patterns are learned from unlabelled data.
Even with the perfect unsupervised machine learning model for anomaly detection figured out, in many ways, the real problems have only begun. What is the best way to put this model into production such that each observation is ingested, transformed and finally scored with the model, as soon as the data arrives from the source system? That too, in a near real-time manner or at short intervals, e.g. every 5-10 minutes? This involves building a sophisticated extract, load, and transform (ELT) pipeline and integrating it with an unsupervised machine learning model that can correctly identify anomalous records. Also, this end-to-end pipeline has to be production-grade, always running while ensuring data quality from ingestion to model inference, and the underlying infrastructure has to be maintained.

Click through to see their solution using Databricks and delta lake.

Comments closed

M	T	W	T	F	S	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30

Curated SQL Posts