Measuring Correlation In SQL

Phil Factor shows how to calculate Kendall’s Tau and Spearman’s Rho in SQL:

Kendall’s Tau rank correlation is a handy way of determining how correlated two variables are, and whether this is more than chance. If you just want a measure of the correlation then you don’t have to assume very much about the distribution of the variables. Kendall’s Tau is popular with calculating correlations with non-parametric data. Spearman’s Rho is possibly more popular for the purpose, but Kendall’s tau has a distribution with better statistical properties (the sample estimate is close to a population variance) so confidence levels are more reliable, but in general, Kendall’s tau and Spearman’s rank correlation coefficient are very similar. The obvious difference between them is that, for the standard method of calculation,  Spearman’s Rank correlation required ranked data as input, whereas the algorithm to calculate Kendall’s Tau does this for you.  Kendall’s Tau consumes any non-parametric data with equal relish.

Kendall’s Tau is easy to calculate on paper, and makes intuitive sense. It deals with the probabilities of observing the agreeable (concordant) and non-agreeable (discordant) pairs of rankings. All observations are paired with each of the others, A concordant pair is one whose members of one observation are both larger than their respective members of the other paired observation, whereas discordant pairs have numbers that differ in opposite directions. Kendall’s Tau-b takes tied rankings into account.

I appreciate Phil putting this series together.  I’d probably stick with R, but it’s good to have options.

Trick Co-Workers With This Extended Property

Kenneth Fisher shows how to use extended properties to hide a table from SQL Server Management Studio:

FYI I’ve tried this at the column and schema levels and it didn’t work.

Using this you can hide the object from SSMS object explorer without restricting its use in any way.

I’m curious if there are any other hidden uses of extended properties. I haven’t been able to find any documentation so if you’ve seen any please let me know!

I don’t think I’ve ever had cause to hide objects from Management Studio, but if you’re looking for next year’s April Fools prank, maybe?

Using OUTPUT To Get Change Counts

Manoj Pandey shows how to use the OUTPUT clause to determine the number of records inserted, updated, or deleted after a DML statement:

–> Question:

How can I get the numbers of records affected in the Merge statement, INSERT,UPDATE,DELETE separately and store it in a variable so I can get it in the application side?

Thanks !
–> My Answer:

You need to use OUTPUT clause with MERGE statement

Click through for a code sample.  The OUTPUT clause also works for non-MERGE statements like INSERT, UPDATE, and DELETE, though the “get changes by type” problem is really limited to the MERGE statement.

Don’t Hard-Code Values

Kevin Feasel



Jana Sattainathan argues against hard-coding values in queries:

I have heard arguments for doing this type source code

  • This is a one-time thing. We do not have the need to do it anywhere else

  • We are on a deadline

  • We do not have the ability to test if this was not done this way

  • My program is going away in a week

  • We do not have the time to correct this

  • I am just following the existing pattern

  • Unofficially (not) said – “This is my job security”

I’m with Jana in principle, but there are performance costs at the margin, making this less of a hard-and-fast rule than I’d like.

Random Number Lists

Kevin Feasel



Kenneth Fisher one method of creating lists of random numbers:

I was working on a blog post this weekend that required a list of random numbers. Now, this isn’t exactly an uncommon task, but it’s not as easy as it would seem. The random number function RAND() has a few inherent flaws. And from what I can remember the random functions from most languages have the same issue.

First a few quotes from BOL about RAND()

Returns a pseudo-random float value from 0 through 1, exclusive.

Note: pseudo-random

If a seed is not specified, the SQL Server Database Engine assigns a seed value at random. For a specified seed value, the result returned is always the same.

If you don’t specify the seed it gets selected at random. But that’s only once per query run, not once per row of the output.

Read on for Kenneth’s solution.

T-SQL Variable Scoping

Bill Fellows points out an oddity of T-SQL variable declarations and scoping:

Crazy enough, the last two are correct. It seems that unlike every other language I’ve worked with, all variables are scoped to the same local scope regardless of where in the script they are defined. Demo the first

Wanna see something even more crazy? Check this version out

I should note that I wanted a submit button.

Multi-Joins In SQL

Kevin Feasel



Vladimir Oselsky hits on something I dislike:

My first gut reaction was that this code is broken and would not run. To my amazement code ran just fine. Now came the hard part, which was to figure out what the code was doing because I have never seen this syntax before. Since I did not understand what I was looking at I could not BING “weird join syntax” to get an answer. As a developer, I learned long time ago to break down code into smallest possible chunks to get the answer.

After I have figured out the relationship between tables, I was able to understand what query was doing. To be able to read query better it can be rewritten in the following way.

Do read Vlad’s post.  I’ve seen terrible misuse of this plus right outer joins, and I don’t know if I’ve ever seen a case where using this syntax made the code easier to understand.

Helpful Scripts

Rolf Tesmer has published a set of helpful scripts to his Github repo:

The below is a link to my GitHub repo of my personal collection of scripts.  BTW this isn’t all of the scripts, but is probably 50% of the ones I have lurking around.

These are either scripts I have developed in my own time over the years, or adapted from various other websites which I found handy at the time for whatever reason.

Where I have remembered that web site (like most people, which is unfortunately almost never) it was typically from sites like stack overflow etc, and if so I have tried to cite it as such to provide credits.

Click through for the link to Rolf’s repo.

Frequency Tables

Mala Mahadevan shows how to generate a frequency table in T-SQL and in R:

My results are as below. I have 1000 records in the table. This tells me that I have 82 occurences of age cohort 0-5, 8.2% of my dataset is from this bracket, 82 again is the cumulative frequency since this is the first record and 8.2 cumulative percent. For the next bracket 06-12 I have 175 occurences, 17.5 %, 257 occurences of age below 12, and 25.7 % of my data is in this age bracket. And so on.

Click through for the T-SQL and R scripts.

OBJECT_ID() In Cross-Server Queries

Kevin Feasel



Denis Gobo ran into a problem with a linked server query he ran:

This past week I needed to run some queries on production to verify there were indexes added on a table. There were several scripts that needed to be run and the last one was the addition of the indexes.  The query given to me was something like the following

FROM LinkedServerName.DatabaseName.sys.indexes
WHERE object_id =(OBJECT_ID('TableName'))

So I ran the query..nothing. Aha maybe they are still running the scripts before that, setting up replication, snapshotting the table etc etc. I will check again in a bit I thought.

Click through for the full reason and how to fix your code in this situation.


June 2017
« May