What has happened over the last few years is that many websites are now getting most of their traffic from sources other than Google. Google is no longer the main source of traffic for most websites, because webmasters pursue other avenues to generate relevant traffic, in particular social networks and newsletter – as it is easier to attract the right people and promote the right content through these channels. Think about this: How did you discover Data Science Central? For most recent members, the answer is not Google anymore. In that sense, Google has lost its monopoly when it comes to finding interesting information on the Internet. The reason is that Google pushes more and more search results from partners, their own products, possibly content that fits with its political agenda, big advertisers, old websites, big websites, and web spammers who find a way to get listed at the top. In the meanwhile, websites such as ours promote more and more articles from little high quality publishers and great bloggers that have a hard time getting decent traffic from Google. For them, we are a much bigger and better source of traffic, than Google.
I think this is a fairly optimistic view of the situation, as there’s a difference between “I want to learn about a topic” versus “I want to learn this specific thing.” I think Vincent’s argument is much stronger on the former, but when it comes to the latter, the first thing I hear people say is that they’re googling it.
The basic idea of a trigram search is quite simple:
- Persist three-character substrings (trigrams) of the target data.
- Split the search term(s) into trigrams.
- Match search trigrams against the stored trigrams (equality search)
- Intersect the qualified rows to find strings that match all trigrams
- Apply the original search filter to the much-reduced intersection
We will work through an example to see exactly how this all works, and what the trade-offs are.
A must-read. N-grams in SQL Server is an example of a non-obvious data architecture which performs much better than the obvious alternative, at least when the conditions are right.
For time series applications, it’s very common to have queries in the following pattern
q=*:*&fq=[NOW-3DAYS TO NOW]
However, this is not a good practice from memory perspective. Under the hood, Solr converts ‘NOW’ to a specific timestamp, which is the time when the query hits Solr. Therefore, two consecutive queries with the same field query fq=[NOW-3DAYS TO NOW] are considered different queries once ‘NOW’ is replaced by the two different timestamp. As a result, both of these queries would hit disk and can’t take advantage of caches.
In most of use cases, missing data of last minute is acceptable. Therefore, try to query in the following way if your business logic allows.
q=*:*&fq=[NOW/MIN-3DAYS TO NOW/MIN]
If you’re using Solr for full text search, this is rather useful information.
Embedded Solr has the same interface as Solr without requiring an HTTP connection. When we “embed” Solr into a Java an application, it provides the exact same API that you would use if you were connecting to a remote Solr instance. We can use embedded Solr for in-memory testing because when we implement test cases, it should not depend on any external resources.
Read on for the code sample.
Metaphone algorithms are designed to produce an approximate phonetic representation, in ASCII, of regular “dictionary” words and names in English and some Latin-based languages. It is intended for indexing words by their English pronunciation. It is one of the more popular of the phonetic algorithms and was published by Lawrence Philips in 1990. A Metaphone is up to ten characters in length.
It is used for fuzzy searches for records where each string to be searched has an index with a Metaphone key. You search for all records with the same or similar metaphone key and then refine the search by some ranking algorithm such as Damerau–Levenshtein distance. Metaphone searches are particularly popular with ‘ancestor’ sites that search on surnames where spellings vary considerably for the same surname. The current version, Metaphone 3, is actively maintained by Lawrence Philips, developed to account for all spelling variations commonly found in English words, first and last names found in the United States and Europe, and non-English words whose native pronunciations are familiar to English-speakers. The source of Metaphone 3 is proprietary, and Lawrence charges a fee to supply the source.
Read on for the script.
The “dirty little secret” about full-text search indexes is that they don’t help with ‘%blabla%’ predicates.
Well, it’s not a secret, it’s right there in the documentation.
A lot of us get the impression that full-text search is designed to handle “full wildcard” searches, probably just because of the name. “Full-Text Searches” sounds like it means “All The Searches”. But that’s not actually what it means.
Kendra’s take is a bit more optimistic than mine; I’m definitely more inclined to dump text out to a Lucene-based indexing system (like Solr or ElasticSearch), as they’ll typically perform faster and solve problems that full-text cannot. Some of that may just be that I was never very good at full-text indexing, though.
Based on this testing, lock contention, which usually results in a performance bottleneck and underutilized resources, was our first “suspect.” We knew that using a commercial Java profiler, such as Yourkit, JProfiler and Java Flight Recorder, would help easily identify locks and determine how much time threads spend waiting on them. Meanwhile, the team had built custom infrastructure that allows one to run experiments with a profiler attached via a single command-line parameter.
In my own testing, the profiler data indeed revealed some contention particularly related to
HdfsUpdateLoglocks, leading to long thread wait time. Although promisingly, this result corresponded somewhat to the description in SOLR-6820, nothing actionable resulted from the experiment.
I like these sorts of case studies because example is the school of mankind. In this particular case, I really like the methodical approach, using available information to search for a root cause. Some of the things Michael calls “false starts” I would consider to be initial steps: checking OS, filesystem, and garbage collection metrics are important even in a case like this in which they did not lead to the culprit, as they help you eliminate suspects.
In my last blog post, Setting up Full-Text Search for PDF files, I detailed how to get things setup. If you tried this you may have noticed that although the searches worked, what you got back was a file name. This isn’t so helpful if your document is an all encompassing 538 pages. So, how do we get a page number back? The best I’ve come up with so far is to split the 538 pages into 538 documents and load / search on those.
My first google search on how to split a pdf into pages came back with, http://www.splitpdf.com/, so I went ahead and used that. I’m sure there is a way to do this through acrobat or even roll your own split functionality via the API.
It’s not a particularly pretty solution, but it does work, and that’s important.
It’s fairly easy to follow the steps of the algorithm (as defined by Wikipedia):
Retain the first letter of the name and drop all other occurrences of a, e, I, o, u, y, h, w.
Replace consonants with digits as follows (after the first letter):
b, f, p, v → 1
c, g, j, k, q, s, x, z → 2
d, t → 3
l → 4
m, n → 5
r → 6
If two or more letters with the same number are adjacent in the original name (before step 1), only retain the first letter; also two letters with the same number separated by ‘h’ or ‘w’ are coded as a single number, whereas such letters separated by a vowel are coded twice. This rule also applies to the first letter.
If you have too few letters in your word that you can’t assign three numbers, append with zeros until there are three numbers. If you have more than 3 letters, just retain the first 3 numbers.