Press "Enter" to skip to content

SQL Server and Bytes Per Character

Solomon Rutzky explains that just because you’ve got a VARCHAR column, it’s not necessarily one byte per character:

For VARCHAR, some of you might be thinking that it was “1” until recently when SQL Server 2019 introduced the “_UTF8” collations. Nope. The last time “1” was correct for VARCHAR was back in SQL Server 7.0, before SQL Server 2000 introduced the Windows collations which offered some Double-Byte Character Sets.

For NVARCHAR, some of you might be thinking that it was “2” until SQL Server 2012 introduced the “_SC” collations that fully support Supplementary Characters (UTF-16). Sorry, still incorrect. “2” was never technically correct for NVARCHAR, it was only temporarily correct for the first few years (until Supplementary Characters were defined in Unicode 3.1, released in March, 2001). Ever since SQL Server 7.0 introduced the NCHARNVARCHAR, and NTEXT datatypes, it has been possible to store whatever UTF-16 byte sequences you want, even if they are currently undefined. The older collations do not recognize surrogate pairs / Supplementary Characters, but that’s not related to SQL Server’s ability to store and retrieve any 16-bit code point. As long as you are using a font that supports Supplementary Characters, they should display correctly.

Solomon is one of a handful of people I’ve met who has collations and characters down cold.