Embedded Solr has the same interface as Solr without requiring an HTTP connection. When we “embed” Solr into a Java an application, it provides the exact same API that you would use if you were connecting to a remote Solr instance. We can use embedded Solr for in-memory testing because when we implement test cases, it should not depend on any external resources.
Read on for the code sample.
Metaphone algorithms are designed to produce an approximate phonetic representation, in ASCII, of regular “dictionary” words and names in English and some Latin-based languages. It is intended for indexing words by their English pronunciation. It is one of the more popular of the phonetic algorithms and was published by Lawrence Philips in 1990. A Metaphone is up to ten characters in length.
It is used for fuzzy searches for records where each string to be searched has an index with a Metaphone key. You search for all records with the same or similar metaphone key and then refine the search by some ranking algorithm such as Damerau–Levenshtein distance. Metaphone searches are particularly popular with ‘ancestor’ sites that search on surnames where spellings vary considerably for the same surname. The current version, Metaphone 3, is actively maintained by Lawrence Philips, developed to account for all spelling variations commonly found in English words, first and last names found in the United States and Europe, and non-English words whose native pronunciations are familiar to English-speakers. The source of Metaphone 3 is proprietary, and Lawrence charges a fee to supply the source.
Read on for the script.
The “dirty little secret” about full-text search indexes is that they don’t help with ‘%blabla%’ predicates.
Well, it’s not a secret, it’s right there in the documentation.
A lot of us get the impression that full-text search is designed to handle “full wildcard” searches, probably just because of the name. “Full-Text Searches” sounds like it means “All The Searches”. But that’s not actually what it means.
Kendra’s take is a bit more optimistic than mine; I’m definitely more inclined to dump text out to a Lucene-based indexing system (like Solr or ElasticSearch), as they’ll typically perform faster and solve problems that full-text cannot. Some of that may just be that I was never very good at full-text indexing, though.
Based on this testing, lock contention, which usually results in a performance bottleneck and underutilized resources, was our first “suspect.” We knew that using a commercial Java profiler, such as Yourkit, JProfiler and Java Flight Recorder, would help easily identify locks and determine how much time threads spend waiting on them. Meanwhile, the team had built custom infrastructure that allows one to run experiments with a profiler attached via a single command-line parameter.
In my own testing, the profiler data indeed revealed some contention particularly related to
HdfsUpdateLoglocks, leading to long thread wait time. Although promisingly, this result corresponded somewhat to the description in SOLR-6820, nothing actionable resulted from the experiment.
I like these sorts of case studies because example is the school of mankind. In this particular case, I really like the methodical approach, using available information to search for a root cause. Some of the things Michael calls “false starts” I would consider to be initial steps: checking OS, filesystem, and garbage collection metrics are important even in a case like this in which they did not lead to the culprit, as they help you eliminate suspects.
In my last blog post, Setting up Full-Text Search for PDF files, I detailed how to get things setup. If you tried this you may have noticed that although the searches worked, what you got back was a file name. This isn’t so helpful if your document is an all encompassing 538 pages. So, how do we get a page number back? The best I’ve come up with so far is to split the 538 pages into 538 documents and load / search on those.
My first google search on how to split a pdf into pages came back with, http://www.splitpdf.com/, so I went ahead and used that. I’m sure there is a way to do this through acrobat or even roll your own split functionality via the API.
It’s not a particularly pretty solution, but it does work, and that’s important.
It’s fairly easy to follow the steps of the algorithm (as defined by Wikipedia):
Retain the first letter of the name and drop all other occurrences of a, e, I, o, u, y, h, w.
Replace consonants with digits as follows (after the first letter):
b, f, p, v → 1
c, g, j, k, q, s, x, z → 2
d, t → 3
l → 4
m, n → 5
r → 6
If two or more letters with the same number are adjacent in the original name (before step 1), only retain the first letter; also two letters with the same number separated by ‘h’ or ‘w’ are coded as a single number, whereas such letters separated by a vowel are coded twice. This rule also applies to the first letter.
If you have too few letters in your word that you can’t assign three numbers, append with zeros until there are three numbers. If you have more than 3 letters, just retain the first 3 numbers.