Press "Enter" to skip to content

Scraping The PASS Budget

Steph Locke shows us how to scrape a PDF, specifically, the PASS operating budget:

With tabulizer, if the data is relatively well formatted in a PDF you can use tabulizer::extract_tables(). This gives you a bunch of data.frames which you can process. Unfortunately, in the case of the PASS budget with 22 pages of tables, including tables that span multiple pages, we’re not so lucky!

We need to fall back to tabulizer::extract_text() and do a lot of wrangling to reconstruct the tables.

Steph shows her work, so click through to see the scripts.