Press "Enter" to skip to content

Curated SQL Posts

Decile Analysis and Logistic Regression

Ridhima Kumar (re-)introduces us to decile analysis:

Decile analysis was once a popularly used technique, however the convention of teaching and bucketing machine learning problems into either ‘classification’ or ‘Regression’ types, lead people to forget Decile analysis type analyses. I am pretty sure, most freshly minted data scientists would not have even heard of Decile analysis. So, coming back to what is Decile Analysis.

Decile Analysis is used to categorize dataset from highest to lowest values or vice versa. (Based on predicted probabilities)

As obvious from the name, the analysis involves dividing the dataset into ten equal groups. Each group should have the same no. of observations/customers.

It ranks customers in the order from most likely to respond to least likely to respond.

Read on to learn the steps and how this ties with the fact that logistic regression is regression.

Comments closed

When the Version Store Fills tempdb

David Fowler takes us through a mental exercise:

Well, there is something else that I’ve seen have a habit of filling TempDB. If you’re using Read Committed Snapshot or RCSI then you’ll also have a version store in your database. I’m not going to go into the details of exactly how version store works, there is plenty of documentation on it out there if you’re interested (perhaps I’ll write a post on it sometime).

The key bit that you need to know is that although SQL will keep version store trimmed down and only keep the rows that are needed, it can only clear rows that are older than the oldest transaction. This is because SQL has no way of knowing what rows are going to be needed by the transaction. Do you see a potential issue here?

Read on for enlightenment.

Comments closed

TDE and Backup Compression

Andy Levy learns the truth:

For years, I thought that native backups of databases using Transparent Data Encryption (TDE) couldn’t be compressed. Between TDE being limited to Enterprise Edition until SQL Server 2019 and my own lack of experience with TDE in prior positions, I hadn’t really experimented with this myself. Some people have even gone so far as to skip compression in their backup jobs for TDE-enabled databases because there’s no need to burn those CPU cycles if you won’t get any compression, right?

But a curious thing happened after I upgraded a portion of my environment to SQL Server 2019 in late 2020. I observed that scheduled backups were compressing for some of my TDE-enabled databases, most notably the newer instances. And when I took ad hoc backups in any environment, they were compressed. So why wasn’t it working everywhere?

Read on for the explanation, though one correction: MAXTRANSFERSIZE is 1MB by default only when the database is not encrypted using TDE (and you aren’t backing up to a tape drive). If the database is encrypted using TDE, the default max transfer size is 64KB, and I think that’s what got Andy.

1 Comment

Bug with Filtered Index on Computed Column

Erik Darling points out a weird bug:

At some point in the past, I blogged about a silent bug with computed columns and clustered column store indexes.

In this post, I’m going to take a quick look at a very loud bug.

Normally, you can’t add a filtered index to a computed column. I’ve always hated that limitation. How nice would that be for so many currently difficult tasks?

Click through to see how you can create a filtered index against a computed column, as well as all of the pain it provides.

Comments closed

Nullable Columns and Power Query

Chris Webb gives us another reason to curse NULL:

Recently I’ve been asked by colleagues with various different types of performance problems why Power BI is generating SQL in a particular way, and the answer has been the presence of nullable columns in the underlying database – whether it’s SQL Server, Snowflake or Databricks. Now I’m not a DBA or any kind of database tuning expert so I can’t comment on why a SQL query performs the way it does on any given platform, but what I can do is show you two examples of how the presence of nullable columns changes the way Power BI and Power Query generate SQL.

Click through to see what happens.

Comments closed

Reinvestment Risk and Yield to Maturity

Sang-Heon Lee looks at reinvestment risk:

From this post, we can learn the reinvestment risk of coupon bond. It is worth noting that 1) YTM is attainable when roll rate is the same as YTM and 2) The argument that coupon rate is equal to YTM at issuance (par yield) is only applied to standard coupon bond with in arrears interest payment schedule. Unlike standard coupon bond, coupon bond with in advance interest payment has a higher YTM than coupon rate at an issuance.

Click through for the explanation as well as the R code used. H/T R-Bloggers.

Comments closed

The Enterprise Eats Software

Jessica Kerr explains why software from large firms is so often terrible:

Software is hard to get right. And every time we don’t, customers leave.

Appointment scheduling that sends a calendar invitation “Join the Zoom” without a link. Checkout screens that delete my credit card number when I change the shipping address. Complete Order that comes back with “Please try again later.” Items can’t all be shipped, can’t all be picked up, and this is a maze to figure out (also Lowe’s). A shopping cart that pops up a generic error modal when any single call to the server fails.

Read the whole thing. A lot of this sounds like an incentive alignment problem: each sub-group within a large firm optimizes for its own benefits, but the sum total of those choices leads to a sub-optimal result for the firm itself, as in the case of bad software driving customers to Amazon.

Comments closed

Spark SQL and Merge Errors from Multiple Source Rows Matched

Manoj Pandey explains an error message in Spark SQL:

UnsupportedOperationException: Cannot perform Merge as multiple source rows matched and attempted to modify the same target row in the Delta table in possibly conflicting ways. By SQL semantics of Merge, when multiple source rows match on the same target row, the result may be ambiguous as it is unclear which source row should be used to update or delete the matching target row. You can preprocess the source table to eliminate the possibility of multiple matches. Please refer to https://docs.microsoft.com/azure/databricks/delta/delta-update#upsert-into-a-table-using-merge

The above error says that while doing MERGE operation on the Target table there shouldn’t be any duplicates in the Source table. This check is applied implicitly by the SQL engine to avoid unnecessary updates and avoid inconsistent data.

Read on for a reproduction and what you can do to resolve the issue.

Comments closed

Foreign Key Constraints and Blocking

Paul White takes a look at blocking due to foreign key checks:

This article covers one such consideration that does not receive much publicity: To minimize blocking, you should think carefully about the indexes used to enforce uniqueness on the parent side of those foreign key relationships.

This applies whether you are using locking read committed or the versioning-based read committed snapshot isolation (RCSI). Both can experience blocking when foreign key relationships are checked by the SQL Server engine.

Under snapshot isolation (SI), there is an extra caveat. The same essential issue can lead to unexpected (and arguably illogical) transaction failures due to apparent update conflicts.

This article is in two parts. The first part looks at foreign key blocking under locking read committed and read committed snapshot isolation. The second part covers related update conflicts under snapshot isolation.

Definitely worth reading the whole thing.

Comments closed

Reasons to Switch to Power BI Gen2

Teo Lachev gives us five reasons to switch to Power BI Gen2:

More memory
Imported models are memory-resident so memory is usually the most constraining factor. With Gen2, the capacity maximum memory applies to the resource itself and not collectively across all resources in the capacity. Let’s say you are on a P1 plan which has a maximum memory capacity of 25GB. With Gen 1, you won’t be able to have two datasets, let’s say 20GB and 10GB, loaded at the same time. However, Gen2 will apply the 25GB limit to each dataset. So, each resource (dataset, report, dataflow) will be boxed within 25 GB. This feat is possible because Gen2 uses a SaaS approach, which means datasets are scattered across multiple cluster nodes instead of being associated with a dedicated capacity. A potential downside, however, could be “noisy neighbor” because a P3 cluster node may co-host datasets from different customers.

Read on for the full list of reasons, as well as three things Teo would like to see improved in a future release.

Comments closed