The tricky part is: data science application scale easily has hundreds of string valued variables, each having hundreds of thousands of tracked values. The possibility of a large number of variable values or level renders the CASE/WHEN solution undesirable- as the query size is proportional to the number variables and values. The JOIN solutions build a query size proportional to the number of variables (again undesirable, but tolerable). However, super deeply nested queries are just not what relational databases expect. And a sequence of updates isn’t easy to support as a single query or view.
As an example of remapping, John shows translating “a” in a column to 1, “b” to 2, “d” to 3, etc.—that is, perhaps mapping each unique string to a unique number.