Press "Enter" to skip to content

Curated SQL Posts

The User-Assigned Managed Identity in ADF

Asanka Padmakumara takes a look at the user defined managed identity:

If you are familiar with Managed Identity concepts in ADF, each ADF instance comes with own System Assigned Managed Identity (MI). We can use that MI to control ADF’s access to any data sources which support Azure AD based authentication. This is considered to be the most secured and recommended way of authenticating ADF with cloud systems. If not, you can use Azure Key vault to store credentials. Let’s take an example on to discuss how User Assigned Managed Identity helps for manage access within multiple ADF environment.

Click through to see how the user assigned managed identity makes life better.

Comments closed

SQL Server Backend for Django

Warren Chu announces a new version of the SQL Server 3rd Party Backend for Django:

We have released version 1.1 of the SQL Server 3rd Party Backend for Django. This release contains support for the upcoming release of Django 4.0, as well as a number of issue fixes.

Our plan is to time releases to coincide with major releases of Django and SQL Server, to ensure users of this project can keep up to date with Django while continuing to use SQL Server as a backend.

Read on to see what this entails.

Comments closed

Choosing a Statistical Test

Antoine Soetewey has a handy chart for us:

Being a teaching assistant in statistics for students with diverse backgrounds, I have the chance to see what is globally not well understood by students.

I have realized that it is usually not a problem for students to do a specific statistical test when they are told which one to use (as long as they have good resources and they have been attentive during classes, of course). However, it appears that the task is much more difficult for them when they need to choose what test to do.

Click through for the chart, as well as a PDF version. H/T R-Bloggers.

Comments closed

Installing Apache Spark

Tomaz Kastrun continues a series on Apache Spark:

Installing Apache Spark on Windows computer will require preinstalled Java JDK (Java Development Kit). Java 8 or later version, with current version 17. On Oracle website, download the Java and install it on your system. Easiest way is to download the x64 MSI Installer. Install the file and follow the instructions. Installer will create a folder like “C:\Program Files\Java\jdk-17.0.1”.

Read on for instructions for both Windows and MacOS. You can also create a container running Spark, which is another helpful method.

Comments closed

Locking Issue with Columnstore Indexes

Joe Obbish troubleshoots an issue on tables with columnstore indexes:

I recently ran into a production issue where a SELECT query that referenced a NOLOCK-hinted table was hitting a 30 second query timeout. Query store wait stats suggested that the issue was blocking on a table with a nonclustered columnstore index (NCCI). This was quite unexpected to me and I was eventually able to produce a reproduction of the issue. I believe this to be a bug in SQL Server that’s present in both RTM and the current CU as of this blog post (CU14). The issue also impacts CCIs as well but I did significantly less testing with that index type.

Read on for the issue, how you can replicate it, and a couple ways to work around it.

Comments closed

Why the Optimizer Doesn’t Look at Buffer Pool Data

Paul Randal has an explanation for us:

SQL Server has a cost-based optimizer that uses knowledge about the various tables involved in a query to produce what it decides is the most optimal plan in the time available to it during compilation. This knowledge includes whatever indexes exist and their sizes and whatever column statistics exist. Part of what goes into finding the optimal query plan is trying to minimize the number of physical reads needed during plan execution.

One thing I’ve been asked a few times is why the optimizer doesn’t consider what’s in the SQL Server buffer pool when compiling a query plan, as surely that could make a query execute faster. In this post, I’ll explain why.

This is an interesting post because it explains why the developers of the database engine would purposefully ignore something that could make things faster, but at a potentially devastating cost.

Comments closed

AD Authentication with SQL Server on Linux

Anthony Nocentino will have none of your SQL authentication:

In this post, we’re going to walk through configuring Active Directory authentication for SQL Server on Linux. We will start by joining the Linux server to the domain, configuring SQL Server on Linux to communicate to the domain, and then use adutil to create our AD users and set up Kerberos for SQL Server login authentication.

This does take a bit more effort than using Windows authentication, but if you want to use SQL Server on Linux, I’d consider it a worthwhile investment of time.

Comments closed

ElasticMapReduce Serverless

Damon Cortesi, et al, announce serverless EMR is now in preview:

Today we’re happy to announce Amazon EMR Serverless, a new option in Amazon EMR that makes it easy and cost-effective for data engineers and analysts to run petabyte-scale data analytics in the cloud. With EMR Serverless, you can run applications built using open-source frameworks such as Apache Spark, Hive, and Presto, without having to configure, manage, optimize, or secure clusters. EMR Serverless automatically provisions and scales the compute and memory resources required by your applications, and you only pay for the resources that your applications use.

In this post, we discuss the benefits of EMR Serverless, walk you through the core concepts of EMR Serverless and how you can use it, and show you a quick demo.

If you’re already using EMR for ephemeral work—that is, using a Spark cluster to perform data transformations and then shutting it down—this makes a lot of sense as long as there’s not a major difference in cost.

Comments closed

A Primer on Apache Spark

Tomaz Kastrun has started a new series:

Apache Spark is an open-source unified analytics engine for large-scale data processing. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. Originally it was developed at the Berkeley’s AMPLab, and later donated to the Apache Software Foundation, which has maintained it since.

Click through to learn more about the product.

Comments closed