Category: Collations

Resolving Collation Conflicts

Published 2021-10-26 by Kevin Feasel

One possible cause of the “Cannot resolve the collation conflict” error message is that your database collation doesn’t match the TempDB Collation.
[…]
Msg 468, Level 16, State 9, Line 11
Cannot resolve the collation conflict between “SQL_Latin1_General_CP437_CI_AS” and “Latin1_General_CI_AS” in the equal to operation.

Read on to see an example of how you can cause this problem and what you can do about it. I agree with Steve’s general advice to keep collations the same across an instance if you can.

Comments closed

Case-Insensitive Collations in Redshift

Published 2021-07-09 by Kevin Feasel

Mengchu Cai, et al, show us how to change collation with Redshift:

Amazon Redshift is a fast, fully managed, cloud-native data warehouse. Tens of thousands of customers have successfully migrated their workloads to Amazon Redshift. We hear from customers that they need case-insensitive collation for strings in Amazon Redshift in order to maintain the same functionality and meet their performance goals when they migrate their existing workloads from legacy, on-premises data warehouses like Teradata, Oracle, or IBM. With that goal in mind, AWS provides an option to create case-insensitive and case-sensitive collation.
In this post, we discuss how to use case-insensitive collation and how to override the default collation. Also, we specifically explain the process to migrate your existing Teradata database using the native Amazon Redshift collation capability.

Specifically, it appears that they have two collations exposed: one which is case-sensitive and the other which is case-insensitive.

Comments closed

Complexities with Binary Collations

Published 2019-03-13 by Kevin Feasel

Solomon Rutzky takes us through the nuances of binary collations:

Still, there are some complexities related to binary collations that you might not be aware of. To figure out what they are, we need to look at why there are so many binary collations in the first place. I mean, binary collations work on the underlying values of the characters, and comparing numbers doesn’t change between cultures or versions: 12 = 12, 12 > 11, and 12 <13, always. So, then what is the difference between:
– Latin1_General_100_BIN2 and Hebrew_100_BIN2 (only the culture is different), or
– Latin1_General_100_BIN2 and Latin1_General_BIN2 (only the version is different), or
–Latin1_General_100_BIN2 and Latin1_General_100_BIN (only the binary comparison type is different)

Read on to find out.

Comments closed

Changing Server Collations En Masse

Published 2019-01-24 by Kevin Feasel

Hugo Kornelis has a script to change server collations across a large number of objects:

Problem is: the new instance was set up with a default collation that was different from the default collation of the old instance. And hence different from the database default collation. And when that resulted in errors, they responded by changing the default collation. Now everything ran and they were happy.
Fast forward two years. The database now has lots of “old” tables that use the original default collation for all string columns. But also lots of “new” tables that have their string columns using the new default collation. And the upgrade failed because somewhere in the new code is a comparison between two character columns that now have incompatible collation.

Click through for the script, as well as the standard disclaimer never blindly to run things in production.

Comments closed

The Hidden Performance Costs Of Collation Mismatch

Published 2018-07-09 by Kevin Feasel

Nate Johnson explains why you want collation consistency when joining tables together on varchar/nvarchar columns:

There’s a subtle difference here, vs. those many community blog posts, which I’ll repeat. The columns are of the same type. Just different collations.

And when the collation on the join predicates is different, bad things happen. Let’s take CustomerNumber for example. On the ERP side, it’s a nvarchar(20) collate Latin1_General_100_CI_AS. On the internal & web apps side, it’s a varchar(20) collateSQL_Latin1_General_CP1_CI_AS. As you might imagine, this is a prime field for joining because it’s the main customer identified throughout all the systems.

Let’s be clear here. This is a numeric value only. Did it need to support Unicode? Absolutely not. Should it have been an int or bigint? Probably. But did The ERP designers choose to make it Unicode string anyway? Yep.

Read on to see how Nate tries to dig himself out of this hole.

Comments closed

More Collation Comparison Fun

Published 2017-12-12 by Kevin Feasel

Solomon Rutzky wraps up his collation comparison series:

Just in case you were wondering: No, the test directly above test does not prove the documentation (as quoted at the top of Part A) correct. Yes, the documentation did state that characters would be converted to the Code Page specified by the Collation of the Database or Column, which does appear to be what is going on here. But, the differences are:

The documentation states that the transformation happens if you reference a Unicode datatype, but what we just saw in the most recent test is the exact opposite:
- only the NVARCHAR columns of the “Subscript 2” row match because they are still the “Subscript 2” character, while the NVARCHAR columns of the other two rows do not match due to being either “2” or “?”.
- transformation did occur in the Latin1 and Hebrew VARCHAR columns, which is how it matched both rows with “2” in the Latin1 column and both rows with “?” in the Hebrew column.
In the scenario involving another column where it would help to prefix the string literal with a capital-“N” (i.e. a VARCHAR column using a Collation that specifies a different Code Page than the Database’s Code Page), the Collation of the Database is not used for the transformation; it is only the referenced column’s Collation. The assumption here is that the string literal without the capital-“N” is being used in a Database where the Collation specifies a Code Page that has all of the characters.
In the scenarios where the Database’s Collation, via its specified Code Page, did transform a string literal that was not prefixed with a capital-“N” (the two tests in the previous post), there would have already been unintended behavior no matter how the string literal was used.

I still think it’s a duck.

Comments closed

Collations Used In NVarchar To Varchar Conversions

Published 2017-12-11 by Kevin Feasel

Solomon Rutzky digs into collations:

So, the actual question is a bit more specific than would reasonably fit into a title, and it is:

In a WHERE condition (in Microsoft SQL Server, if that’s not obvious), when a string constant containing Unicode characters but not prefixed with a capital-N is compared to a column, which Collation (and hence Code Page) is used to do the implicit conversion of the Unicode characters into VARCHAR? Is it the Database’s default Collation, or the column’s Collation?

The answer has a lot more nuance than you might first think. And for the record, it’s a duck.

Comments closed

Supplementary Characters In SQL Server 2017

Published 2017-10-17 by Kevin Feasel

Solomon Rutzky tests SQL Server 2017 collations to determine whether they support supplementary characters:

As you scroll through the list, does anything appear to be missing? Go ahead, take another look. I’ll wait :wink:. Figure it out? Yep, that’s right: none of those Collations end in “_SC“. Collations ending in “_SC” were added in SQL Server 2012 and support the full UTF-16 character set. Without the “_SC” ending built-in functions and comparisons / sorting only fully support the base UCS-2 character set (i.e. the first 65,536 Unicode Code Points). You can, of course, store and view all Unicode Code Points, even Supplementary Characters, in non-“_SC” Collations, but they will be interpreted as being two separate “unknown” characters instead of as a single character.

To illustrate this point, the following query shows: the two surrogate Code Points (not actual characters by themselves, but when used in pairs of any of the assigned combinations, produce a single character), the resulting Supplementary Character, and how the LEN function interprets that sequence in both non-“_SC” and “_SC” Collations.

Read on for Solomon’s testing, which includes a cat face emoji. I know that I, for one, thought there was insufficient cat face emoji representation in SQL Server prior to 2017.

Comments closed

Binary Collation Case-Sensitivity

Published 2017-07-27 by Kevin Feasel

Solomon Rutzky explains that binary collations are not really case-sensitive:

Quite often people will use, or will recommend using, a binary Collation (one ending in “_BIN” or “_BIN2“) when wanting to do a case-sensitive operation. While in many cases it appears to behave as expected, it is best to not use a binary Collation for this purpose. The problem with using binary Collations to achieve case-sensitivity is that they have no concept of linguistic rules and cannot equate different versions of characters that should be considered equal. And the reason why using a binary Collation often appears to work correctly is simply the result of working with a set of characters that has no accents or other versions. One such character set (a common one, hence the confusion), is US English (i.e. “A” – “Z” and “a” – “z”; values 65 – 90 and 97 – 122, respectively). However, there are a few areas where binary collations don’t behave as many (most, perhaps?) people expect them to.

Solomon gives examples of false negatives (such as the same character represented by different code point combinations) and also explains how sort order can change.

Comments closed

Making Text Search Faster

Published 2016-03-08 by Kevin Feasel

Rob Farley looks at binary collations and specifically hash functions on aggregation:

If there’s an index on the column you’re grouping, then changing the collation is going to hurt a bit. Grouping could take advantage of a Stream Aggregate under our indexed collation, but changing the column is like throwing it away the index order (ORDER BY doesn’t get handled well by changing the collation) means a Hash is required. But comparing two query plans that both use Hash Match (Aggregate), one on a case-insensitive collation and one on a binary collation, then I found the latter was slightly faster. Not as drastic a change as searching, but still 10-30% better. One would run in about 12 seconds, and one in about 10.

Be sure to check out his comments for more details.

Comments closed