Finding Different Casings In Case-Insensitive Columns

Kevin Feasel

2018-08-09

Data

Louis Davidson walks through different permutations of case in a data set:

The other day, I had a problem with some data that I never dreamed I would ever see. In a case insensitive database, in a table’s column that was case insensitive, the customer was using the data as case sensitive. Firstly, let’s just go ahead and say it. “This was a sucky implementation.” But as is common, in my typical role as a data architect in the data warehousing team, I get to learn all sorts of interesting techniques for finding and dealing with “data” that has been used in “interesting” ways.

What is kind of interesting is actually figuring out what that duplicated data was. The case that I was dealing with wasn’t a kind of useful packed surrogate value, where you may use a base 62 number, with a-z, A-Z and 0-9 as characters. So 1, 2, … , 9, 0, a, b, c, … x, y, z, A, B.. etc. 1A1 is a different value in that sequence than 1a1, and is greater . Neat technique, and one that I have been threatening to develop using a SEQUENCE object, where you can pack in a lot of sequential data in a small number of bytes. No, this wasn’t a useful case such as this, in this case, one value was lower case, another had leading capitals. So perhaps “active customer” and “Active Customer”. Yeah, seriously, they meant different things.

Louis shows some of the nuance required in making this work.

Related Posts

Building Test Data Following A Normal Distribution In T-SQL

I (finally) have a technical blog post: In order to show you the solution, I want to build up a reasonable sized sample.  Any solution looks great when reading five records, but let’s kick that up a notch.  Or, more specifically, a million notches:  I’m going to use a CTE tally table and load 5 […]

Read More

Kaggle-Maintained Data

Kevin Feasel

2018-11-09

Data

Noah Daniels announces Maintained by Kaggle data sets: The “Maintained by Kaggle” badge means that Kaggle is now and will continue to actively maintain that dataset. This includes regular updates to descriptions and metadata, quicker response rates in discussion, and accurate current data from the source. Our goal is to create seamless workflows that allow […]

Read More

Categories

August 2018
MTWTFSS
« Jul Sep »
 12345
6789101112
13141516171819
20212223242526
2728293031