Press "Enter" to skip to content

Inferring Data from Its Absence

John Cook lays out an important insight:

One of the Safe Harbor provisions under HIPAA is that data may not contain sparsely populated three-digit zip codes. Sometimes databases will replace sparse zip codes with nulls. But if the same database reports a person’s state, and the state only has one sparse zip code, then the data effectively lists all zip codes. Here the suppressed zip code is conspicuous by its absence. The null value itself didn’t reveal the zip code, nor did the state, but the combination did.

Read the whole thing. This also leads to a swath of security attacks based around unions of information in which each query may data only when X number of people are in it (to prevent us narrowing down to one person) but based on some information I know about the person, I can write a combination of queries to elicit more info about that person. As an example, if I know that a person is left-handed (1/9 of the population), has red hair (around 2% of people), etc., I can find ways to combine these traits to make sure no individual query returns fewer than X results but I can have reasonably high confidence that I can get the individual with enough queries.