Scraping The PASS Budget

Kevin Feasel

2018-01-02

R

Steph Locke shows us how to scrape a PDF, specifically, the PASS operating budget:

With tabulizer, if the data is relatively well formatted in a PDF you can use tabulizer::extract_tables(). This gives you a bunch of data.frames which you can process. Unfortunately, in the case of the PASS budget with 22 pages of tables, including tables that span multiple pages, we’re not so lucky!

We need to fall back to tabulizer::extract_text() and do a lot of wrangling to reconstruct the tables.

Steph shows her work, so click through to see the scripts.

Related Posts

Packages For Testing R Packages

Maelle Salmon shows us how to test our R packages within R: If you’re brand-new to unit testing your R package, I’d recommend reading this chapter from Hadley Wickham’s book about R packages. There’s an R package called RUnit for unit testing, but in the whole post we’ll mention resources around the testthat package since it’s the one we use in […]

Read More

Reshaping Data Frames With tidyr

Anisa Dhana shows off some of the data reshaping functionality available in the tidyr package: As it is shown above, the variable agegp has 6 groups (i.e., 25-34, 35-44) which has different alcohol intake and smoking use combinations. I think it would be interesting to transform this dataset from long to wide and to create a column for each […]

Read More

Categories

January 2018
MTWTFSS
« Dec Feb »
1234567
891011121314
15161718192021
22232425262728
293031