Data Cleansing With R

I continue my series on launching a data science project:

Now that we’ve performed some basic analysis, we will clean up the data set. I’m doing most of the cleanup in a single operation, but I do have some comment notes here, particularly around the oddities with SalaryUSD. The SalaryUSD column has a few problems:

  • Some people put in pennies, which aren’t really that important at the level we’re discussing. I want to strip them out.
  • Some people put in delimiters like commas or decimal points (which act as commas in countries like Germany). I want to strip them out, particularly because the decimal point might interfere with my analysis, turning 100.000 to $100 instead of $100K.
  • Some people included the dollar sign, so remove that, as well as any spaces.

It’s not a perfect regex, but it did seem to fix the problems in this data set at least.

Something I’ve liked about the data professionals survey is that there are a few places with room for data cleansing, but not everything is awful.  It’s neither artificially clean nor beyond repair, so it’s good for use as an example.

Related Posts

AzureR Packages In Cran

David Smith points out that the Azure packages for R are now in CRAN: The suite of AzureR packages for interfacing with Azure services from R is now available on CRAN. If you missed the earlier announcements, this means you can now use the install.packages function in R to install these packages, rather than having to install from the […]

Read More

Solving Naive Bayes By Hand

I have a post that requires math and is meaner toward the Buffalo Bills than I normally am: Trust the ProcessThere are three steps to the process of solving the simplest of Naive Bayes algorithms. They are:1. Find the probability of winning a game (that is, our prior probability).2. Find the probability of winning given each input variable: whether Josh Allen starts the game, whether the team is […]

Read More

Categories

March 2018
MTWTFSS
« Feb Apr »
 1234
567891011
12131415161718
19202122232425
262728293031