Next, we want to assess the field-level, distributional health of our data. Distribution tells us all of the expected values of our data, as well as how frequently each value occurs. One of the simplest questions is, “how often is my data
NULL
”? In many cases, some level of incomplete data is acceptable — but if a 10% null rate turns into 90%, we’ll want to know.
This covers a couple examples around data freshness and completeness, and I appreciate the level of detail in here. Nothing is earth-shattering, but at the same time, it’s important to have a catalog of the sorts of issues which can pop up. H/T Mark Hutchinson.