Kevin Feasel



Troy Walters uses the Tabulizer package to extract tables from a PDF and turn them into an R matrices or data frames:

Next we will use the extract_tables() function from tabulizer. First, I specify the url of the pdf file from which I want to extract a table. This pdf link includes the most recent data, covering the period from July 1, 2016 to November 25, 2016. I am using the default parameters for extract_tables. These are guess and method. I’ll leave guess set to TRUE, which tells tabulizer that we want it to figure out the locations of the tables on its own. We could set this to FALSE if we want to have more granular control, but for this application we don’t need to. We leave the method argument set to “matrix”, which will return a list of matrices (one for each pdf page). This could also be set to return data frames instead.

This is nice.  I have to imagine it only works for text-based PDFs and not ones which are generated from a series of images.

Related Posts

Python versus R (Again)

Alex Woodie looks at whether Python is dominating R in the data science space: There is some evidence that Python’s popularity is hurting R usage. According to the TIOBE Index, Python is currently the third most popular language in the world, behind perennial heavyweights Java and C. From August 2018 to August 2019, Python usage surged […]

Read More

Local Randomness and R

Evgeni Chasnovski has a problem around generating random data: Let’s say we have a deterministic (non-random) problem for which one of the solutions involves randomness. One very common example of such problem is a function minimization on certain interval: it can be solved non-randomly (like in most methods of optim()), or randomly (the simplest approach being […]

Read More


December 2016
« Nov Jan »