Search – Page 3 – Curated SQL

Trigram Search In SQL Server

Published 2017-09-12 by Kevin Feasel

Paul White shows how to implement trigram wildcard searches in SQL Server:

The basic idea of a trigram search is quite simple:

Persist three-character substrings (trigrams) of the target data.

Split the search term(s) into trigrams.

Match search trigrams against the stored trigrams (equality search)

Intersect the qualified rows to find strings that match all trigrams

Apply the original search filter to the much-reduced intersection

We will work through an example to see exactly how this all works, and what the trade-offs are.

A must-read. N-grams in SQL Server is an example of a non-obvious data architecture which performs much better than the obvious alternative, at least when the conditions are right.

Comments closed

Improving Solr Performance

Published 2017-06-19 by Kevin Feasel

Michael Sun has some tips to improve performance of Solr operations, focusing on memory tuning but including a few other tips as well:

For time series applications, it’s very common to have queries in the following pattern

q=*:*&fq=[NOW-3DAYS TO NOW]

However, this is not a good practice from memory perspective. Under the hood, Solr converts ‘NOW’ to a specific timestamp, which is the time when the query hits Solr. Therefore, two consecutive queries with the same field query fq=[NOW-3DAYS TO NOW] are considered different queries once ‘NOW’ is replaced by the two different timestamp. As a result, both of these queries would hit disk and can’t take advantage of caches.

In most of use cases, missing data of last minute is acceptable. Therefore, try to query in the following way if your business logic allows.

q=*:*&fq=[NOW/MIN-3DAYS TO NOW/MIN]

If you’re using Solr for full text search, this is rather useful information.

Comments closed

Embedded Solr With Scala

Published 2017-04-13 by Kevin Feasel

Anurag Srivastava shows how to use Embedded Solr using an example written in Scala:

Embedded Solr has the same interface as Solr without requiring an HTTP connection. When we “embed” Solr into a Java an application, it provides the exact same API that you would use if you were connecting to a remote Solr instance. We can use embedded Solr for in-memory testing because when we implement test cases, it should not depend on any external resources.

Read on for the code sample.

Comments closed

Metaphones In SQL

Published 2017-02-15 by Kevin Feasel

Phil Factor builds a function to generate metaphones in SQL Server:

Metaphone algorithms are designed to produce an approximate phonetic representation, in ASCII, of regular “dictionary” words and names in English and some Latin-based languages. It is intended for indexing words by their English pronunciation. It is one of the more popular of the phonetic algorithms and was published by Lawrence Philips in 1990. A Metaphone is up to ten characters in length.

It is used for fuzzy searches for records where each string to be searched has an index with a Metaphone key. You search for all records with the same or similar metaphone key and then refine the search by some ranking algorithm such as Damerau–Levenshtein distance. Metaphone searches are particularly popular with ‘ancestor’ sites that search on surnames where spellings vary considerably for the same surname. The current version, Metaphone 3, is actively maintained by Lawrence Philips, developed to account for all spelling variations commonly found in English words, first and last names found in the United States and Europe, and non-English words whose native pronunciations are familiar to English-speakers. The source of Metaphone 3 is proprietary, and Lawrence charges a fee to supply the source.

Read on for the script.

Comments closed

Full-Text Search

Published 2017-02-10 by Kevin Feasel

Kendra Little gives the scoop on full-text indexing:

The “dirty little secret” about full-text search indexes is that they don’t help with ‘%blabla%’ predicates.

Well, it’s not a secret, it’s right there in the documentation.

A lot of us get the impression that full-text search is designed to handle “full wildcard” searches, probably just because of the name. “Full-Text Searches” sounds like it means “All The Searches”. But that’s not actually what it means.

Kendra’s take is a bit more optimistic than mine; I’m definitely more inclined to dump text out to a Lucene-based indexing system (like Solr or ElasticSearch), as they’ll typically perform faster and solve problems that full-text cannot. Some of that may just be that I was never very good at full-text indexing, though.

Comments closed

Solr Lock Contention

Published 2016-08-17 by Kevin Feasel

Michael Sun shows how the Apache Solr team found and fixed a performance issue in their code:

Based on this testing, lock contention, which usually results in a performance bottleneck and underutilized resources, was our first “suspect.” We knew that using a commercial Java profiler, such as Yourkit, JProfiler and Java Flight Recorder, would help easily identify locks and determine how much time threads spend waiting on them. Meanwhile, the team had built custom infrastructure that allows one to run experiments with a profiler attached via a single command-line parameter.

In my own testing, the profiler data indeed revealed some contention particularly related to VersionBucket andHdfsUpdateLog locks, leading to long thread wait time. Although promisingly, this result corresponded somewhat to the description in SOLR-6820, nothing actionable resulted from the experiment.

I like these sorts of case studies because example is the school of mankind. In this particular case, I really like the methodical approach, using available information to search for a root cause. Some of the things Michael calls “false starts” I would consider to be initial steps: checking OS, filesystem, and garbage collection metrics are important even in a case like this in which they did not lead to the culprit, as they help you eliminate suspects.

Comments closed

PDF Search With Page Numbers

Published 2016-06-28 by Kevin Feasel

Jon Morisi has a solution for how to get page numbers for results back from PDFs when using Full-Text Search:

In my last blog post, Setting up Full-Text Search for PDF files, I detailed how to get things setup. If you tried this you may have noticed that although the searches worked, what you got back was a file name. This isn’t so helpful if your document is an all encompassing 538 pages. So, how do we get a page number back? The best I’ve come up with so far is to split the 538 pages into 538 documents and load / search on those.

My first google search on how to split a pdf into pages came back with, http://www.splitpdf.com/, so I went ahead and used that. I’m sure there is a way to do this through acrobat or even roll your own split functionality via the API.

It’s not a particularly pretty solution, but it does work, and that’s important.

Comments closed

Implementing SoundEx

Published 2016-06-15 by Kevin Feasel

Dror Helper shows how to implement SoundEx in C#:

It’s fairly easy to follow the steps of the algorithm (as defined by Wikipedia):

Retain the first letter of the name and drop all other occurrences of a, e, I, o, u, y, h, w.
Replace consonants with digits as follows (after the first letter):
- b, f, p, v → 1
- c, g, j, k, q, s, x, z → 2
- d, t → 3
- l → 4
- m, n → 5
- r → 6
If two or more letters with the same number are adjacent in the original name (before step 1), only retain the first letter; also two letters with the same number separated by ‘h’ or ‘w’ are coded as a single number, whereas such letters separated by a vowel are coded twice. This rule also applies to the first letter.
If you have too few letters in your word that you can’t assign three numbers, append with zeros until there are three numbers. If you have more than 3 letters, just retain the first 3 numbers.

SQL Server also supports SOUNDEX as a built-in function.

Comments closed

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Category: Search