Press "Enter" to skip to content

Data Cleansing With R

I continue my series on launching a data science project:

Now that we’ve performed some basic analysis, we will clean up the data set. I’m doing most of the cleanup in a single operation, but I do have some comment notes here, particularly around the oddities with SalaryUSD. The SalaryUSD column has a few problems:

  • Some people put in pennies, which aren’t really that important at the level we’re discussing. I want to strip them out.
  • Some people put in delimiters like commas or decimal points (which act as commas in countries like Germany). I want to strip them out, particularly because the decimal point might interfere with my analysis, turning 100.000 to $100 instead of $100K.
  • Some people included the dollar sign, so remove that, as well as any spaces.

It’s not a perfect regex, but it did seem to fix the problems in this data set at least.

Something I’ve liked about the data professionals survey is that there are a few places with room for data cleansing, but not everything is awful.  It’s neither artificially clean nor beyond repair, so it’s good for use as an example.