Solomon Rutzky explains that just because you’ve got a VARCHAR
column, it’s not necessarily one byte per character:
For VARCHAR
, some of you might be thinking that it was “1” until recently when SQL Server 2019 introduced the “_UTF8
” collations. Nope. The last time “1” was correct for VARCHAR
was back in SQL Server 7.0, before SQL Server 2000 introduced the Windows collations which offered some Double-Byte Character Sets.
For NVARCHAR
, some of you might be thinking that it was “2” until SQL Server 2012 introduced the “_SC
” collations that fully support Supplementary Characters (UTF-16). Sorry, still incorrect. “2” was never technically correct for NVARCHAR
, it was only temporarily correct for the first few years (until Supplementary Characters were defined in Unicode 3.1, released in March, 2001). Ever since SQL Server 7.0 introduced the NCHAR
, NVARCHAR
, and NTEXT
datatypes, it has been possible to store whatever UTF-16 byte sequences you want, even if they are currently undefined. The older collations do not recognize surrogate pairs / Supplementary Characters, but that’s not related to SQL Server’s ability to store and retrieve any 16-bit code point. As long as you are using a font that supports Supplementary Characters, they should display correctly.
Solomon is one of a handful of people I’ve met who has collations and characters down cold.