T-SQL – Page 74 – Curated SQL

Basic Forensic Accounting Techniques

Published 2019-04-17 by Kevin Feasel

I continue my series on forensic accounting techniques:

Growth analysis focuses on changes in ratios over time. For example, you may plot annual revenue, cost, and net margin by year. Doing this gives you an idea of how the company is doing: if costs are flat but revenue increases, you can assume economies of scale or economies of scope are in play and that’s a great thing. If revenue is going up but costs are increasing faster, that’s not good for the company’s long-term outlook.
For our data set, I’m going to use the following SQL query to retrieve bus counts on the first day of each year. To make the problem easier, I add and remove buses on that day, so we don’t need to look at every day or perform complicated analyses.

I get into quite a bit in this post, including a quick tour of multicollinearity, which is only my second-favorite of the three linear regression amigos (heteroskedasticity being my favorite and autocorrelation the hanger-on).

Comments closed

Rollback’s Effect on Identity Columns

Published 2019-04-12 by Kevin Feasel

Adrian Buckman explains that rollbacks on identity columns still burn those identity values:

As I say – This is just what I has seen people do and it was only the other day when I saw a similar situation but with an insert instead, The user believed that because the changes were made within a transaction this would rollback EVERYTHING however they did not consider the impact on the Identity column on the table they made the insert in.
Here is an example to demonstrate how a rollback on an insert will not rollback your identity seed on your table.

Click through for the demo. Sequences behave in practice the same way: once you pull that next sequence ticket, you can’t put it back into the machine just by rolling back the transaction. That’s why identity columns and sequences aren’t good for situations where you absolutely need contiguous data, such as invoice numbers or check numbers.

Comments closed

Matrix Operations with JSON

Published 2019-04-10 by Kevin Feasel

Phil Factor takes a look at using JSON to perform memoization:

For the SQL Server developer, matrices are probably most valuable for solving more complex string-searching problems, using Dynamic Programming. Once you get into the mindset of this sort of technique, a number of seemingly-intractable problems become easier. Here are fifty common data structure problems that can be solved using Dynamic programming. Until SQL Server 2017, these were hard to do in SQL because of the lack of support for this style of programming. Memoization, one of the principles behind the technique is easy to do in SQL but it is very tricky to convert existing procedural algorithms to use table variables. It is usually easier and quicker to use strings as pseudo-variables as I did with Edit Distance and the Levenshtein algorithm, the longest common subsequence, and the Longest Common Substring. The problem with doing this is that the code to fetch the array values can be very difficult to decypher or debug. JSON can do it very easily with path array references.

The results aren’t fantastic but the code is easier at least.

Comments closed

Generating Sketchy Data

Published 2019-04-10 by Kevin Feasel

I have a post on building up a data set for my forensic accounting series:

This is where stuff gets crazy. First, I created a table named #ValuePerCategory, which has the mean price and the price standard deviation for each expense category. To get this information, I trawled through the catalog and picked reasonable-enough values for each of the categories. This is my level of commitment to getting things right(ish). The standard deviations, though, I just made up. I didn’t look at huge numbers of products and calculate these values myself. That’s the limit of my commitment to excellence and why I don’t have a giant banner on my stadium.

It’s also why John Madden never coached me.

Comments closed

Finding High-Cardinality Columns

Published 2019-04-08 by Kevin Feasel

Constantine Kokkinos shows how you can find the cardinality of each column on a SQL table:

Today I was diving into some extremely wide tables, I wanted to take a quick look at things like “How many unique values does this table have in every column?”.
This can be super useful if you have a spreadsheet of results or a schema without effective normalization and you want to determine which rows are the “most unique” – or have high cardinality.
The Github gist is embedded at the bottom of the page, but I will run you through the code in case you want an explanation of how it works

Click through for the script.

Comments closed

Finding and Removing Bad Characters Using Tally Tables

Published 2019-04-05 by Kevin Feasel

Louis Davidson shows how you can use a tally table to find and remove invalid characters from strings:

Now, the idea is that we will join the Application.People table to the Numbers table for a number of rows. We will do this for all of the numbers that are from 1 to the length of the name. Then use that value to get the substring of the value for that 1 character. I also include the Unicode value in the output to allow for some case sensitive operations, since UNICODE(‘a’) <> UNICODE(‘A’).

This is an example of how powerful tally tables can be.

Comments closed

Approaches to Deleting Data in Batches

Published 2019-04-05 by Kevin Feasel

Andy Mallon shares a couple approaches to deleting data in batches:

In this scenario, we’re going to keep the data for X days after it’s created. Then we delete it. That’s it. X could be 3 days or 3 years–it doesn’t matter, we’ll follow the same design pattern.
In today’s world, we generate loads of log data, sensor data, telemetry data, etc. All that data is super duper valuable. But only for a while. Eventually, all that granular data becomes less useful, and isn’t worth keeping around. Maybe it gets aggregated, summarized, or maybe it just gets thrown out.
You’ll have a lot of data with more complex requirements, but I think you’ll also be surprised at how much data has simple date-based retention based on it’s creation.

Also read the comments, as they include additional techniques.

Comments closed

Auto-Escaping XML Characters

Published 2019-03-27 by Kevin Feasel

Emanuele Meazzo shows how you can auto-escape XML characters using T-SQL:

Recently I had to look up the definition for a bunch of SQL objects and didn’t want to manually retrieve them manually in SSMS (with Create Scripts) or Visual Studio (by searching the object name in my TFS repository).
Since lazyness and automation are the basis of a well done engineering work, I wanted to create a list, where I could basically click on the object that I needed and see the definition right away, without any tool or having to code something externally, of course.

Click through for the solution, which is short and sweet.

Comments closed

Finding Missing Values with Tally Tables

Published 2019-03-27 by Kevin Feasel

David Fowler shows one way to find missing values using a tally table:

This is going to be a bit of a brain storming post that comes from an interesting question that I was asked today…
“I’ve got a table with a ID code field, now some of the rows have a value in that field and some are NULL. How can I go about filling in those NULL values with a valid code but at the same time avoid introducing duplicates?”

Click through for David’s solution.

Comments closed

Defending ANSI-89 Syntax

Published 2019-03-26 by Kevin Feasel

Allan Hirt prefers ANSI-89 join syntax:

Pedro Lopes who is now on the SQL Server dev team wrote the blog post “T-SQL Misconceptions – JOIN ON vs. WHERE” which does a good job of explaining the “it depends” around the newer style of joins. Why link this article? Microsoft deprecated the OUTER JOIN operators in SQL Server 2008. Two other sources of information on this are here and here (the first is Ward Pond’s old technet blog, and sadly will probably go away soon). If you’re keeping score at home, WHERE clauses are not deprecated except if you’re using *= and =*). The changes people made were wholly unnecessary and as the author, the newer stuff is harder to decipher than what I originally did. They were putting their own biases onto things.

I personally do not like ANSI-89 syntax because it blurs the lines between filters and join criteria. Those are separate things serving different purposes and keeping them separate clarifies queries more than it obscures. Allan’s example doesn’t have any filters but in a more complex scenario with several filters and several join criteria, it can require extra care figuring out what’s going on, especially with multi-column join criteria and filters mixed in (meaning it’s not join criteria and then filters, but a mishmash of the two).

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30

Category: T-SQL