Curated SQL – Page 508 – A Fine Slice Of SQL Server

Tuning a Range Query

Published 2022-10-04 by Kevin Feasel

Recently I was looking through DBA.StackExchange when I saw a pretty simple question that I decided to answer. I went off, set up a test database and some test tables and quickly wrote a query to answer the question. I got it all formatted pretty and was on my way to post it when I saw that another answer was already there.
Yeah. Identical to mine. Almost line for line.
Well, nuts.
I know. I’ll write a blog post.

In thinking about the problem, the thing that caught my mind was Grant’s comment about poor design. This got me thinking about one of my favorite topics: orthogonal design for relational excellence. The idea of a BETWEEN table of [ MinValue : MaxValue ] is the first thing people think of but is also the worst because you have two big problems: gaps and overlap.

The second solution is to use MinValue and calculate MaxValue (if we actually need it) as LEAD(MinValue) OVER (ORDER BY MinValue) - e, where e represents the smallest reasonable increment we’d need. Queries would find, for each Value in the main table, the largest MinValue below Value. That removes gaps and overlap but might be a performance concern as the main table’s data size grows.

The big-brain solution, which generally works best when you have a discrete number of options, could be a tally table. In Grant’s example, we see values from 1 to 1000, with a rank for each. If it’s really as simple as that, we would create a new lookup table with Value + RankDesc and simply join the main table’s Value to the lookup table’s Value to get the appropriate RankDesc. Yeah, you have 1000 rows instead of 3 but queries are trivial at that point. The downside is that this approach doesn’t work for continuous variables (e.g., give me the exact amount of your household income for the prior tax year) and the utility of this solution probably breaks down once you get past tens of thousands of rows.

In the case of a continuous variable or an enormous discrete variable, we have the simplest option of all: ignore something. If you care about the range, use the table from the second solution and use that ID on the main table. If you care about the value but not the range, just have the value and no lookup table.

Comments closed

A Month of Community Tools

Published 2022-10-04 by Kevin Feasel

Erik Darling is no tool:

Over the past month (plus or minus a couple days), I’ve shown you in a series of quick posts how I use different SQL Server Community Tools that are free and open source to troubleshoot SQL Server issues.

There’s a lot of great content from Erik here; if you haven’t already been following along.

Comments closed

The Basics of Automating Data Cleaning

Published 2022-10-03 by Kevin Feasel

Vincent Granville provides some guidance:

To the junior data scientist, it looks like each new dataset comes with a new set of challenges. It seems that you can not automate data cleaning. To decisions makers and stakeholders, this problem is so remote to them that they don’t even know the amount of resources wasted on this. To them, it seems obvious that automation is the way to go, but they may underestimate the challenges. It is usually not a high priority in many organizations, despite how much money it costs.
Yet, there are at most a few dozens of issues that come with data cleaning. Not a few thousands, not a few hundreds. You can catalog them and address all of them at once with a piece of code. One that you can reuse each time when you face a new data set. I describe here the main issues and how to address them. Automating the data cleaning step can save you a lot of time, and eliminate boring, repetitive tasks to make your data scientists happier.

Click through for Vincent’s thoughts and recommendations.

Comments closed

Strategies for De-Functionizing Your T-SQL

Published 2022-10-03 by Kevin Feasel

Tom Zika continues the 12-step program for scalar function abusers:

In the previous posts, we have learned why Scalar Functions (UDFs) are bad for parallelism and performance and what the options are for their removal.
The only remaining question is where to start. You’ve probably guessed there isn’t one true way to approach this. So instead, I offer several strategies that you can combine.

Plus Tom has an offer for you at the bottom of the post: conversion from Scalar UDFs to Inline Table-Valued Functions, which tend to perform much better.

Comments closed

Premium Azure SQL DB Performance

Published 2022-10-03 by Kevin Feasel

Reitse Eskens is moving on up:

The standard tier starts at 125 DTU’s and goes up to 4000. DTU’s are made up from a magic mix of CPU, memory, read iops and write iops. An iop (Input Output oPeration) should be a 4kb (disk cluster size) read or write. 125 DTU translates to 500 Kb/sec up to 32.000 Kb/sec. As we’re used to datapages which are 8Kb in size, you could say these databases are able to pull 62 to 4.000 pages per second from disk. When there are simultaneous writes, you’ll share the performance. At least that’s my interpretation of the IOP. For the DTU part, I’m still struggling to get a good grip on what it exactly is, beyond the magic mix.

It’s also a good idea to compare this to what the Standard tier has to offer. The general data patterns look similar with respect to elbows but the magnitudes are quite different, with Premium P1 starting out around Standard S4 in the test for insertion but more like S3 for selects.

Comments closed

Interacting with Microsoft Graph API via Synapse

Published 2022-10-03 by Kevin Feasel

Paul Hernandez starts a new series:

In this and the next post I want to show you how to connect to the Microsoft Graph API, request some data, process it and store it in a database using Synapse Analytics.
This first post presents a sample use case, briefly introduces the Graph API, how to create a linked service to it, and how to start querying data. In the next post a sample Synapse pipeline will be described. The pipeline grabs some data and copies it into some target tables. Finally, I will create a sample query to showcase the newly imported data.

Because there’s some potential confusion to people, Graph API is completely different from the idea of graph databases.

Comments closed

Creating Repeatable Test Data

Published 2022-10-03 by Kevin Feasel

Louis Davidson repeats himself:

In order to test graph structures, I needed a large set of random data. In some ways, this data will resemble the IMDB database I will include later in this chapter, but to make it one, controllable in size and two, random, I created this random dataset. I loaded a set of values for an account into a table and a set of interests. I wanted then to be able to load a set of random data into edge, related one account to another (one follows the other), and then an account to a random set of interests.
In this article, I will discuss a few techniques I used, starting with the simplest method using ordering by NEWID(), then using RAND() to allow the code to generate a fixed set of data.

There’s a lot of code needed to do it but if it’s something you’ve got to do, that’s the cost of doing business.

Comments closed

Troubleshooting High Non-SQL CPU Utilization

Published 2022-10-03 by Kevin Feasel

Ajay Dwivedi finds out it wasn’t the database:

Since SQL Server is not a cheap application and would cost a lot of money if we need to scale it horizontally. So it is not a good practice to allow any other resource-consuming processes to run alongside with SQL Server.
In this blog, we are going to troubleshoot a high CPU scenario when the CPU issue is caused by non-SQL Server process. Let’s say, I receive a call from the monitoring team where the application team is complaining about a “slow” SQL Server.

Ajay first uses SQLMonitor to perform this troubleshooting and then shows how to do this without SQLMonitor.

Comments closed

Oracle Machine Learning for R in Oracle Autonomous Database

Published 2022-09-30 by Kevin Feasel

Brendan Tierney notes an update to Oracle Autonomous Database:

Oracle Machine Learning for R (OML4R) is available on Oracle Autonomous Database. Finally. After waiting for way, way too long we can now run R code in the Autonomous Database (in the Cloud). It’s based on using Oracle R Distribution 4.0.5 (which is based on R 4.0.5). This product was previously called Oracle R Enterprise, which I was a fan of many few years ago, so much so I wrote a book about it.

I’m curious how well this has done compared to SQL Server ML Services—my experience is that getting ML Services adoption has been a struggle.

Comments closed

Testing API Packages in R

Published 2022-09-30 by Kevin Feasel

Jamie Owen does some testing:

This blog post is a follow on to our API as a package series, which looks to expand on the topic of testing {plumber} API applications within the package structure leveraging {testthat}. As a reminder of the situation, so far we have an R package that defines functions that will be used as endpoints in a {plumber} API application. The API routes defined via {plumber} decorators in inst simply map the package functions to URLs.

Jamie covers a lot of testing ground in that post as well, so check it out.

Comments closed

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

Curated SQL Posts