Shredding Excel With R

Kevin Feasel



John MacKintosh shows how to use R for wrangling + ETL:

I had over 140 files to process. That’s not usually a big deal – I normally use SQL Server Integration Services to loop through network folders, connect to hundreds of spreadsheets and extract the source data.

But this relies on the data being in a tabular format (like a dataframe or database table).

A quick glance at the first few sheets confirmed I could not use this approach – the data was not in tabular format. Instead it was laid out in a format suited to viewing the data on screen – with the required data scattered in different ranges throughout each sheet ( over 100 rows and many columns). It wasn’t going to be feasible to point SSIS at different locations within each sheet. (It can be done, but it’s pretty complex and I didn’t have time to experiment).

The other challenge was that over time, changes to design meant that data moved location e.g. dates that were originally in cell C2 moved to D7, then moved again as requirements evolved. There were 14 different templates in all, each with subtle changes. Each template was going to need a custom solution to extract the data.

This is a good look at how R can be about more than “just” statistical analysis.

Related Posts


John Mount explains the vtreat package that he and Nina Zumel have put together: When attempting predictive modeling with real-world data you quicklyrun into difficulties beyond what is typically emphasized in machine learning coursework: Missing, invalid, or out of range values. Categorical variables with large sets of possible levels. Novel categorical levels discovered during test, cross-validation, or […]

Read More

R 3.4.4 Now Available

David Smith notes that R 3.4.4 is now generally available: R 3.4.4 has been released, and binaries for Windows, Mac, Linux and now available for download on CRAN. This update (codenamed “Someone to Lean On” — likely a Peanuts reference, though I couldn’t find which one with a quick search) is a minor bugfix release, and shouldn’t cause […]

Read More


January 2017
« Dec Feb »