Press "Enter" to skip to content

Category: Search

PDF Search With Page Numbers

Jon Morisi has a solution for how to get page numbers for results back from PDFs when using Full-Text Search:

In my last blog post, Setting up Full-Text Search for PDF files, I detailed how to get things setup.  If you tried this you may have noticed that although the searches worked, what you got back was a file name.  This isn’t so helpful if your document is an all encompassing 538 pages.  So, how do we get a page number back?  The best I’ve come up with so far is to split the 538 pages into 538 documents and load / search on those.

My first google search on how to split a pdf into pages came back with, http://www.splitpdf.com/, so I went ahead and used that.  I’m sure there is a way to do this through acrobat or even roll your own split functionality via the API.

It’s not a particularly pretty solution, but it does work, and that’s important.

Comments closed

Implementing SoundEx

Dror Helper shows how to implement SoundEx in C#:

It’s fairly easy to follow the steps of the algorithm (as defined by Wikipedia):

  1. Retain the first letter of the name and drop all other occurrences of a, e, I, o, u, y, h, w.

  2. Replace consonants with digits as follows (after the first letter):

    • b, f, p, v → 1

    • c, g, j, k, q, s, x, z → 2

    • d, t → 3

    • l → 4

    • m, n → 5

    • r → 6

  3. If two or more letters with the same number are adjacent in the original name (before step 1), only retain the first letter; also two letters with the same number separated by ‘h’ or ‘w’ are coded as a single number, whereas such letters separated by a vowel are coded twice. This rule also applies to the first letter.

  4. If you have too few letters in your word that you can’t assign three numbers, append with zeros until there are three numbers. If you have more than 3 letters, just retain the first 3 numbers.

SQL Server also supports SOUNDEX as a built-in function.

Comments closed