Advice For A Budding Data Scientist

Charles Parker riffs off of an Edsger Dijkstra note:

It’s still early days for machine learning. The bounds and guidelines about what is possible or likely are still unknown in a lot of places, and bigger projects that test more of those limitations are more likely to fail. As a fledgling data engineer, especially in the industry, it’s almost certainly the more prudent course to go for the “low-hanging fruit” — easy-to-find optimizations that have real world impact for your organization. This is the way to build trust among skeptical colleagues and also the way to figure out where those boundaries are, both for the field and for yourself.

As a personal example, I was once on a project where we worked with failure data from large machines with many components. The obvious and difficult problem was to use regression analysis to predict the time to failure for a given part. I had some success with this, but nothing that ever made it to production. However, a simple clustering analysis that grouped machines by the frequency of replacement for all parts had some lasting impact; this enabled the organization to “red flag” machines that fell into “high replacement” group where the users may have been misusing the machines and bring these users in for training.

There’s some good advice.  Also read the linked Dijkstra note; even in bullet point form, he was a brilliant guy.

Related Posts

Multi-Shot Games

Dan Goldstein explains a counter-intuitive probability exercise: Peter Ayton is giving a talk today at the London Judgement and Decision Making Seminar Imagine being obliged to play Russian roulette – twice (if you are lucky enough to survive the first game). Each time you must spin the chambers of a six-chambered revolver before pulling the trigger. […]

Read More

Visualizing Emergency Room Visits

Eugene Joh has a great blog post showing how to parse ICD-9 codes using regular expressions and then visualize the results as a treemap: It looks like there is a header/title at [1], numeric grouping  at [2] “1.\tINFECTIOUS AND PARASITIC DISEASES”,  subgrouping by ICD-9 code ranges, at [3] “Intestinal infectious diseases (001-009)” and then 3-digit ICD-9 […]

Read More


May 2017
« Apr