Binary Collation Case-Sensitivity

Solomon Rutzky explains that binary collations are not really case-sensitive:

Quite often people will use, or will recommend using, a binary Collation (one ending in “_BIN” or “_BIN2“) when wanting to do a case-sensitive operation. While in many cases it appears to behave as expected, it is best to not use a binary Collation for this purpose. The problem with using binary Collations to achieve case-sensitivity is that they have no concept of linguistic rules and cannot equate different versions of characters that should be considered equal. And the reason why using a binary Collation often appears to work correctly is simply the result of working with a set of characters that has no accents or other versions. One such character set (a common one, hence the confusion), is US English (i.e. “A” – “Z” and “a” – “z”; values 65 – 90 and 97 – 122, respectively). However, there are a few areas where binary collations don’t behave as many (most, perhaps?) people expect them to.

Solomon gives examples of false negatives (such as the same character represented by different code point combinations) and also explains how sort order can change.

Making Text Search Faster

Rob Farley looks at binary collations and specifically hash functions on aggregation:

If there’s an index on the column you’re grouping, then changing the collation is going to hurt a bit. Grouping could take advantage of a Stream Aggregate under our indexed collation, but changing the column is like throwing it away the index order (ORDER BY doesn’t get handled well by changing the collation) means a Hash is required. But comparing two query plans that both use Hash Match (Aggregate), one on a case-insensitive collation and one on a binary collation, then I found the latter was slightly faster. Not as drastic a change as searching, but still 10-30% better. One would run in about 12 seconds, and one in about 10.

Be sure to check out his comments for more details.

Anglicize Values

Dave Mattingly shows an easy way to anglicize values:

If your customer’s name is “José” but you search for “Jose”, you won’t (by default) find him.

Here’s a simple way to take care of that in your SQL database, without changing the data that you have.

If a particularly system only needs to support one language (e.g., English), this can be helpful, at least until somebody throws in Chinese or Hebrew characters.  That said, supporting Unicode is the best move when available.

Categories

September 2017
MTWTFSS
« Aug  
 123
45678910
11121314151617
18192021222324
252627282930