Empty Strings vs NULL – Curated SQL

Empty Strings Aren’t Better Than NULLs
Prove me wrong.

Click through for commenters’ perspectives.

I’m not at the C.J. Date level of “NULLs are a mockery of the entire relational system and should be burned to the ground and that ground be salted and that salt be burned, just in case” but I do have my sympathies there.

An empty string has meaning: the value of a given attribute in this tuple is an empty string. NULL has no value and therefore violates first normal form (which, unlike my joking histrionics above, is an argument Date lays out in far too much detail for me to include here).

So what if you don’t know the value of a thing? Then the answer is, don’t store it! If there are certain attributes which may be missing at insertion time and are nonetheless relevant to maintain, use 6th Normal Form for each of those attributes. Then, the existence of a record indicates that it has a value and the lack of existence indicates that there is no value as of this time. This is quite different from NULL, about which we can only say “I don’t know if there is a value at this time, but if you’d like, I can do weird things with some of your queries if you happen to forget about this non-information.”

Kevin Feasel 2021-06-08

Hmm, the issue I see there is that “This has an unknown value” is itself not a fact. Which is why something like `WHERE Col1 = NULL` doesn’t work in SQL. There’s a whole epistemic rabbit hole I’m trying my hardest not to go down.

But to go a little bit further into my answer, my official stance is:
1. I understand and sympathize with people who are anti-NULL. If null references are the billion-dollar mistake (https://www.infoq.com/presentations/Null-References-The-Billion-Dollar-Mistake-Tony-Hoare/), NULL in databases isn’t that far behind.
2. When designing a new data model, I try my hardest to avoid NULL when I can. But:
3. There are times when it is practical to include NULL, such as when working with systems which send data to you.
4. This is similar to the boundary model when working with a functional programming language versus other types: within the boundary of my F# (or Scala or whatever), I can happily write code without worrying about such things. But once I get to the boundaries of code, where there is interaction with outside forces, I need to watch out for barbarians at the gate.

So to directly answer your question of “What if I need to store that I lack information on a thing?” I’d turn it around by asking, “What would you store if you had no information at all?” The answer there is that you wouldn’t store anything–that’s why we don’t (in healthy systems) have rows with all NULLs. So you’re definitely storing something but want to indicate that some element is missing–late-arriving data, perhaps.

Suppose we have a table dbo.Pet with attributes Name, Color, Type (dog, cat, goldfish, etc.), Breed, and FavoriteBedToSleepOn. All of our pets will have names and let’s say that names are a candidate key. All pets have colors and types and we know what they are. All pets have breeds, but sometimes we don’t know what the breed is. Not all pets can have favorite beds to sleep on: sometimes we don’t know which bed they sleep on, and for pets like goldfish, it’s impossible for them to sleep on a bed.

So how do we handle this in a world without NULLs? In that case, we go to 6NF for potentially-missing data. This gives us dbo.Pet with Name, Color, and Type.

Then, we create a dbo.PetFavoriteBedtoSleepOn with Name (the candidate key) and FavoriteBedToSleepOn. If the pet has a favorite bed upon which to sleep, we insert a row; otherwise, we don’t. This gives you the information you need without violating 1NF (which requires that each tuple be made up of *values* and NULL is not a value).

So what do we do with Breed? We *might* put Breed in dbo.Pet if our “Unknown” breed is an actual string (or a key to another table with an “Unknown Breed” value). That can make sense because we do have factual information: we do not know what the breed of this pet is. In that case, we’re saying that’s different from NULL, in that NULL represents the lack of data. With NULL, we’re saying “We might actually know what the breed is, but our system is not aware of it.” The “true” value of that information is not NULL; it’s something else. But the true value of “Unknown breed” is that we are the arbiters of breed and cannot determine with sufficient likelihood the correct answer. It may feel like splitting hairs, and if so, that’s fine: just create dbo.PetBreed with Name and Breed; then, if you don’t know the breed, you don’t have a row in the table.

When retrieving data, you can include the idea of NULL on return if your app really wants it, or have different queries based on whether you need information on favorite beds. At that point, we aren’t storing NULLs in the database, and that’s where the pragmatist in me ends up convincing the anti-NULL zealot to relax a bit.

3 Comments

George Walkey 2021-06-08

“So what if you don’t know the value of a thing? Then the answer is, don’t store it!”
Kevin, but what IF we need to know that fact itself?
Null to the rescue
- Kevin Feasel 2021-06-08
  
  Hmm, the issue I see there is that “This has an unknown value” is itself not a fact. Which is why something like `WHERE Col1 = NULL` doesn’t work in SQL. There’s a whole epistemic rabbit hole I’m trying my hardest not to go down.
  
  But to go a little bit further into my answer, my official stance is:
  1. I understand and sympathize with people who are anti-NULL. If null references are the billion-dollar mistake (https://www.infoq.com/presentations/Null-References-The-Billion-Dollar-Mistake-Tony-Hoare/), NULL in databases isn’t that far behind.
  2. When designing a new data model, I try my hardest to avoid NULL when I can. But:
  3. There are times when it is practical to include NULL, such as when working with systems which send data to you.
  4. This is similar to the boundary model when working with a functional programming language versus other types: within the boundary of my F# (or Scala or whatever), I can happily write code without worrying about such things. But once I get to the boundaries of code, where there is interaction with outside forces, I need to watch out for barbarians at the gate.
  
  So to directly answer your question of “What if I need to store that I lack information on a thing?” I’d turn it around by asking, “What would you store if you had no information at all?” The answer there is that you wouldn’t store anything–that’s why we don’t (in healthy systems) have rows with all NULLs. So you’re definitely storing something but want to indicate that some element is missing–late-arriving data, perhaps.
  
  Suppose we have a table dbo.Pet with attributes Name, Color, Type (dog, cat, goldfish, etc.), Breed, and FavoriteBedToSleepOn. All of our pets will have names and let’s say that names are a candidate key. All pets have colors and types and we know what they are. All pets have breeds, but sometimes we don’t know what the breed is. Not all pets can have favorite beds to sleep on: sometimes we don’t know which bed they sleep on, and for pets like goldfish, it’s impossible for them to sleep on a bed.
  
  So how do we handle this in a world without NULLs? In that case, we go to 6NF for potentially-missing data. This gives us dbo.Pet with Name, Color, and Type.
  
  Then, we create a dbo.PetFavoriteBedtoSleepOn with Name (the candidate key) and FavoriteBedToSleepOn. If the pet has a favorite bed upon which to sleep, we insert a row; otherwise, we don’t. This gives you the information you need without violating 1NF (which requires that each tuple be made up of *values* and NULL is not a value).
  
  So what do we do with Breed? We *might* put Breed in dbo.Pet if our “Unknown” breed is an actual string (or a key to another table with an “Unknown Breed” value). That can make sense because we do have factual information: we do not know what the breed of this pet is. In that case, we’re saying that’s different from NULL, in that NULL represents the lack of data. With NULL, we’re saying “We might actually know what the breed is, but our system is not aware of it.” The “true” value of that information is not NULL; it’s something else. But the true value of “Unknown breed” is that we are the arbiters of breed and cannot determine with sufficient likelihood the correct answer. It may feel like splitting hairs, and if so, that’s fine: just create dbo.PetBreed with Name and Breed; then, if you don’t know the breed, you don’t have a row in the table.
  
  When retrieving data, you can include the idea of NULL on return if your app really wants it, or have different queries based on whether you need information on favorite beds. At that point, we aren’t storing NULLs in the database, and that’s where the pragmatist in me ends up convincing the anti-NULL zealot to relax a bit.
George Walkey 2021-06-08

Totally agree, relaxing is good
Hard to pick “unknown” from a PBI Slicer
but remember the Unknown member from SSAS Cube Days

Comments are closed.

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30