Press "Enter" to skip to content

OCR and Character Extraction with R

Benjamin Smith analyzes a text:

Since the text that I’m using has with two columns per page, the text will need to be cropped by columns before OCR is applied. Prior to that, the .pdf files will need to be converted to .png format.

Read on to see the code for the entire process, using the tidyverse, magick, and tesseract packages.