Randomization of Personally Identifiable Information

Rich Benner tries out a couple of techniques:

The main issue we see in dev environments is that people take a nice little version of their database, a few hundred rows of data per table, and develop on that. This is great for checking that your logic is correct, but not good when it comes to actually deploying the code to production. Suddenly, your nice, pretty code has to deal with millions of rows of data and grinds to a halt because you didn’t write it for big data sets. No, you wrote it for your development system, and it was fast on your machine. We see this a lot.

Rich shows a couple of techniques for data randomization. My biggest challenge to this is that if you need a proper distribution of data, you lose it. Using the telephone number example, if you have lookups or data analysis by area code, randomly generating across every area code would be bad. Also, if your application is smart enough to deal with valid or invalid area codes and exchanges (the middle three digits of the three-three-four phone number style in the US), you generating arbitrary area codes or exchanges might prevent app developers from using the application in the proper way, perhaps requiring them to fix phone numbers after viewing a data entry screen.

In short, there are easier and harder ways to do this, and several factors may push you into the harder way.

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31