Steph Locke shows us how to scrape a PDF, specifically, the PASS operating budget:
With
tabulizer
, if the data is relatively well formatted in a PDF you can usetabulizer::extract_tables()
. This gives you a bunch of data.frames which you can process. Unfortunately, in the case of the PASS budget with 22 pages of tables, including tables that span multiple pages, we’re not so lucky!We need to fall back to
tabulizer::extract_text()
and do a lot of wrangling to reconstruct the tables.
Steph shows her work, so click through to see the scripts.