Phil Factor explains that your technique for pseudonymizing data doesn’t necessarily anonymize the data:
It is possible to mine data for hidden gems of information by looking at significant patterns of data. Unfortunately, this sometimes means that published datasets can reveal sensitive data when the publisher didn’t intend it, or even when they tried to prevent it by suppressing any part of the data that could enable individuals to be identified
Using creative querying, linking tables in ways that weren’t originally envisaged, as well as using well-known and documented analytical techniques, it’s often possible to infer the values of ‘suppressed’ data from the values provided in other, non-suppressed data. One man’s data mining is another man’s data inference attack.
Read the whole thing. One big problem with trying to anonymize data is that you don’t know how much the attacker knows. Especially with outliers or smaller samples, you might be able to glean interesting information with a series of queries. Even if the application only returns aggregated results for some N, you can often put together a set of queries where you slice the population different ways until you get hidden details on individual. Phil covers these types of inference attacks.