PDF Search With Page Numbers

Kevin Feasel

2016-06-28

Search

Jon Morisi has a solution for how to get page numbers for results back from PDFs when using Full-Text Search:

In my last blog post, Setting up Full-Text Search for PDF files, I detailed how to get things setup.  If you tried this you may have noticed that although the searches worked, what you got back was a file name.  This isn’t so helpful if your document is an all encompassing 538 pages.  So, how do we get a page number back?  The best I’ve come up with so far is to split the 538 pages into 538 documents and load / search on those.

My first google search on how to split a pdf into pages came back with, http://www.splitpdf.com/, so I went ahead and used that.  I’m sure there is a way to do this through acrobat or even roll your own split functionality via the API.

It’s not a particularly pretty solution, but it does work, and that’s important.

Related Posts

Improving Solr Performance

Kevin Feasel

2017-06-19

Search

Michael Sun has some tips to improve performance of Solr operations, focusing on memory tuning but including a few other tips as well: For time series applications, it’s very common to have queries in the following pattern q=*:*&fq=[NOW-3DAYS TO NOW] However, this is not a good practice from memory perspective. Under the hood, Solr converts […]

Read More

Embedded Solr With Scala

Anurag Srivastava shows how to use Embedded Solr using an example written in Scala: Embedded Solr has the same interface as Solr without requiring an HTTP connection. When we “embed” Solr into a Java an application, it provides the exact same API that you would use if you were connecting to a remote Solr instance. […]

Read More

Categories

June 2016
MTWTFSS
« May Jul »
 12345
6789101112
13141516171819
20212223242526
27282930