Press "Enter" to skip to content

Category: T-SQL

Breaking Up Queries with UNION ALL

Bert Wagner takes us through a scenario where it can be faster to combine queries with UNION ALL rather than using IN:

Even though this query reads the whole clustered index to get the Benefactor rows, the total number of logical reads is still smaller than the seek/key lookup pattern seen in the combined query with IN(). This UNION ALL version gives SQL Server the ability to build a hybrid execution plan, combining two different techniques to generate a plan with fewer overall reads.

Click through for the example.

Comments closed

The Importance of Aliasing in Subqueries

Gail Shaw explains an unexpected result when writing a statement with a subquery:

The column name in the temp table is missing an I, probably just a typo, but it has some rather pronounced effects.

The obvious next question is why the select with the subquery in it didn’t fail, after all, the query asks for ClientID from #TempClients, and there’s no such column. However there is a ClientID column available in that query, and it’s in the Orders table. And that’s a valid column for the subquery, because column binding order, when we have subqueries, is first to tables within the subquery, and then, if no match is found, to tables in the outer query.

I know that the first time I experienced this, I thought it was a bug as well. But no, it’s by design and Gail explains why.

Comments closed

Multi-Pattern Replacement with SQL Server

Hugo Kornelis has a pattern matching problem to solve:

The actual use case and the list of patterns that I had to remove are considered a confidential competitive advantage by my client, so I will just make up a list here to illustrate the problem. In this fake requirement, the following patterns must all be removed if anywhere in the string: “-@@%”, “@@%”, “-@%”, “@%”, “No.X#X”, and “^^^”. Also, “@X” needs to be removed when at the end of the string. In these patterns, @ is a placeholder for a numeric symbol, X for an alphabetic symbol, and # for an alphabetic or numeric symbol. All other symbols are to be taken literally. (So “%” is the percent mark, not the LIKE pattern for any character!).

This is a problem for regular expressions, but without built-in regular expressions (and I’d guess no desire to use the CLR), Hugo gives us a workable solution.

Comments closed

Forensic Accounting: Cohort Analysis

I continue my series on forensic accounting techniques with cohort analysis:

In the last post, we focused on high-level aggregates to gain a basic understanding of our data. We saw some suspicious results but couldn’t say much more than “This looks weird” due to our level of aggregation. In this post, I want to dig into data at a lower level of detail. My working conception is the cohort, a broad-based comparison of data sliced by some business-relevant or analysis-relevant component.

Those familiar with Kimball-style data warehousing already understand where I’m going with this. In the basic analysis, we essentially look at fact data with a little bit of disaggregation, such as looking at data by year. In this analysis, we introduce dimensions (sort of) and slice our data by dimensions.

Click through for some fraud-finding fun.

Comments closed

Basic Forensic Accounting Techniques

I continue my series on forensic accounting techniques:

Growth analysis focuses on changes in ratios over time. For example, you may plot annual revenue, cost, and net margin by year. Doing this gives you an idea of how the company is doing: if costs are flat but revenue increases, you can assume economies of scale or economies of scope are in play and that’s a great thing. If revenue is going up but costs are increasing faster, that’s not good for the company’s long-term outlook.

For our data set, I’m going to use the following SQL query to retrieve bus counts on the first day of each year. To make the problem easier, I add and remove buses on that day, so we don’t need to look at every day or perform complicated analyses.

I get into quite a bit in this post, including a quick tour of multicollinearity, which is only my second-favorite of the three linear regression amigos (heteroskedasticity being my favorite and autocorrelation the hanger-on).

Comments closed

Rollback’s Effect on Identity Columns

Adrian Buckman explains that rollbacks on identity columns still burn those identity values:

As I say – This is just what I has seen people do and it was only the other day when I saw a similar situation but with an insert instead, The user believed that because the changes were made within a transaction this would rollback EVERYTHING however they did not consider the impact on the Identity column on the table they made the insert in.

Here is an example to demonstrate how a rollback on an insert will not rollback your identity seed on your table.

Click through for the demo. Sequences behave in practice the same way: once you pull that next sequence ticket, you can’t put it back into the machine just by rolling back the transaction. That’s why identity columns and sequences aren’t good for situations where you absolutely need contiguous data, such as invoice numbers or check numbers.

Comments closed

Matrix Operations with JSON

Phil Factor takes a look at using JSON to perform memoization:

For the SQL Server developer, matrices are probably most valuable for solving more complex string-searching problems, using Dynamic Programming. Once you get into the mindset of this sort of technique, a number of seemingly-intractable problems become easier.  Here are fifty common data structure problems that can be solved using Dynamic programming. Until SQL Server 2017, these were hard to do in SQL because of the lack of support for this style of programming.  Memoization, one of the principles behind the technique is easy to do in SQL but it is very tricky to convert existing procedural algorithms to use table variables. It is usually easier and quicker to use strings as pseudo-variables as I did  with Edit Distance and the Levenshtein algorithmthe longest common subsequence, and  the Longest Common Substring. The problem with doing this is that the code to fetch the array values can be very difficult to decypher or debug. JSON can do it very easily with path array references.

The results aren’t fantastic but the code is easier at least.

Comments closed

Generating Sketchy Data

I have a post on building up a data set for my forensic accounting series:

This is where stuff gets crazy. First, I created a table named #ValuePerCategory, which has the mean price and the price standard deviation for each expense category. To get this information, I trawled through the catalog and picked reasonable-enough values for each of the categories. This is my level of commitment to getting things right(ish). The standard deviations, though, I just made up. I didn’t look at huge numbers of products and calculate these values myself. That’s the limit of my commitment to excellence and why I don’t have a giant banner on my stadium.

It’s also why John Madden never coached me.

Comments closed

Finding High-Cardinality Columns

Constantine Kokkinos shows how you can find the cardinality of each column on a SQL table:

Today I was diving into some extremely wide tables, I wanted to take a quick look at things like “How many unique values does this table have in every column?”.

This can be super useful if you have a spreadsheet of results or a schema without effective normalization and you want to determine which rows are the “most unique” – or have high cardinality.

The Github gist is embedded at the bottom of the page, but I will run you through the code in case you want an explanation of how it works

Click through for the script.

Comments closed