Brent Ozar looks at survey results:
No matter which way you slice it, about half are letting developers work with data straight outta production. We’re not masking personally identifiable data before the developers get access to it.
It was the same story about 5 years ago when I asked the same question, and back then, about 2/3 of the time, developers were using production data as-is:
Brent covers some of the challenges involved, and I can add one more: the idea of environments gets really squishy when talking about data science. My development model still needs production data (unless the dev data has the same structural attributes and data distributions as prod), and I don’t really want to train different models in dev/test/prod because, even with the same default data, many algorithms are stochastic in nature: if I run it multiple times, I can end up with different results. And even if I can get the same results by re-running and using a consistent seed, that also introduces a structural instability because I’m relying on a specific seed.
In short, I agree with Brent: this is a tough nut to crack.
Comments closed