Press "Enter" to skip to content

Category: Search

Semantic Search in SQL Server

Haroon Ashraf wraps up a series:

Being the final part of the article, it is going to take you to the next level of analyzing word documents stored in Windows folders, managed by File Table, and consumed by Semantic Search.

Additionally, the readers are going to gain more understanding of Semantic Search and how to make it work with MS Word documents for analysis.

This article provides a name-based analysis of the documents with equal attention to both theory and practice.

Click through for the culmination of all of this filestream work.

Comments closed

Semantic Search with FileTable

Haroon Ashraf continues a series on semantic search with Windows and SQL Server:

The focus of the article is on comparing documents that can be stored on Windows File System in one respect and in the other respect their comparative analysis that can be performed with Semantic Search in SQL Server.

Additionally, the readers will learn how to store unstructured data by exploring File Table and creating MS Word documents on the fly (instantly) to be consumed by Semantic Search.

This part of the article is related to the use of Semantic Search on unstructured data for the extraction of basic level business-crucial information provided standard naming is in place.

Click through for the article.

Comments closed

The Decline(?) Of Google Search

Vincent Granville argues that Google search is on a slow decline:

What has happened over the last few years is that many websites are now getting most of their traffic from sources other than Google. Google is no longer the main source of traffic for most websites, because webmasters pursue other avenues to generate relevant traffic, in particular social networks and newsletter – as it is easier to attract the right people and promote the right content through these channels. Think about this: How did you discover Data Science Central? For most recent members, the answer is not Google anymore. In that sense, Google has lost its monopoly when it comes to finding interesting information on the Internet. The reason is that Google pushes more and more search results from partners, their own products, possibly content that fits with its political agenda, big advertisers, old websites, big websites, and web spammers who find a way to get listed at the top. In the meanwhile, websites such as ours promote more and more articles from little high quality publishers and great bloggers that have a hard time getting decent traffic from Google. For them, we are a much bigger and better source of traffic, than Google.

I think this is a fairly optimistic view of the situation, as there’s a difference between “I want to learn about a topic” versus “I want to learn this specific thing.”  I think Vincent’s argument is much stronger on the former, but when it comes to the latter, the first thing I hear people say is that they’re googling it.

Comments closed

Trigram Search In SQL Server

Paul White shows how to implement trigram wildcard searches in SQL Server:

The basic idea of a trigram search is quite simple:

  1. Persist three-character substrings (trigrams) of the target data.
  2. Split the search term(s) into trigrams.
  3. Match search trigrams against the stored trigrams (equality search)
  4. Intersect the qualified rows to find strings that match all trigrams
  5. Apply the original search filter to the much-reduced intersection

We will work through an example to see exactly how this all works, and what the trade-offs are.

A must-read.  N-grams in SQL Server is an example of a non-obvious data architecture which performs much better than the obvious alternative, at least when the conditions are right.

Comments closed

Improving Solr Performance

Michael Sun has some tips to improve performance of Solr operations, focusing on memory tuning but including a few other tips as well:

For time series applications, it’s very common to have queries in the following pattern

q=*:*&fq=[NOW-3DAYS TO NOW]

However, this is not a good practice from memory perspective. Under the hood, Solr converts ‘NOW’ to a specific timestamp, which is the time when the query hits Solr. Therefore, two consecutive queries with the same field query fq=[NOW-3DAYS TO NOW] are considered different queries once ‘NOW’ is replaced by the two different timestamp. As a result, both of these queries would hit disk and can’t take advantage of caches.

In most of use cases, missing data of last minute is acceptable. Therefore, try to query in the following way if your business logic allows.

q=*:*&fq=[NOW/MIN-3DAYS TO NOW/MIN]

If you’re using Solr for full text search, this is rather useful information.

Comments closed

Embedded Solr With Scala

Anurag Srivastava shows how to use Embedded Solr using an example written in Scala:

Embedded Solr has the same interface as Solr without requiring an HTTP connection. When we “embed” Solr into a Java an application, it provides the exact same API that you would use if you were connecting to a remote Solr instance. We can use embedded Solr for in-memory testing because when we implement test cases, it should not depend on any external resources.

Read on for the code sample.

Comments closed

Metaphones In SQL

Phil Factor builds a function to generate metaphones in SQL Server:

Metaphone algorithms are designed to produce an approximate phonetic representation, in ASCII, of regular “dictionary” words and names in English and some Latin-based languages. It is intended for indexing words by their English pronunciation. It is one of the more popular of the phonetic algorithms and was published by Lawrence Philips in 1990. A Metaphone is up to ten characters in length.

It is used for fuzzy searches for records where each string to be searched has an index with a Metaphone key. You search for all records with the same or similar metaphone key and then refine the search by some ranking algorithm such as Damerau–Levenshtein distance. Metaphone searches are particularly popular with ‘ancestor’ sites that search on surnames where spellings vary considerably for the same surname. The current version, Metaphone 3, is actively maintained by Lawrence Philips, developed to account for all spelling variations commonly found in English words, first and last names found in the United States and Europe, and non-English words whose native pronunciations are familiar to English-speakers. The source of Metaphone 3 is proprietary, and Lawrence charges a fee to supply the source.

Read on for the script.

Comments closed

Full-Text Search

Kendra Little gives the scoop on full-text indexing:

The “dirty little secret” about full-text search indexes is that they don’t help with ‘%blabla%’ predicates.

Well, it’s not a secret, it’s right there in the documentation.

A lot of us get the impression that full-text search is designed to handle “full wildcard” searches, probably just because of the name. “Full-Text Searches” sounds like it means “All The Searches”. But that’s not actually what it means.

Kendra’s take is a bit more optimistic than mine; I’m definitely more inclined to dump text out to a Lucene-based indexing system (like Solr or ElasticSearch), as they’ll typically perform faster and solve problems that full-text cannot.  Some of that may just be that I was never very good at full-text indexing, though.

Comments closed

Solr Lock Contention

Michael Sun shows how the Apache Solr team found and fixed a performance issue in their code:

Based on this testing, lock contention, which usually results in a performance bottleneck and underutilized resources, was our first “suspect.” We knew that using a commercial Java profiler, such as Yourkit, JProfiler and Java Flight Recorder, would help easily identify locks and determine how much time threads spend waiting on them. Meanwhile, the team had built custom infrastructure that allows one to run experiments with a profiler attached via a single command-line parameter.

In my own testing, the profiler data indeed revealed some contention particularly related to VersionBucket andHdfsUpdateLog locks, leading to long thread wait time. Although promisingly, this result corresponded somewhat to the description in SOLR-6820, nothing actionable resulted from the experiment.

I like these sorts of case studies because example is the school of mankind.  In this particular case, I really like the methodical approach, using available information to search for a root cause.  Some of the things Michael calls “false starts” I would consider to be initial steps:  checking OS, filesystem, and garbage collection metrics are important even in a case like this in which they did not lead to the culprit, as they help you eliminate suspects.

Comments closed

PDF Search With Page Numbers

Jon Morisi has a solution for how to get page numbers for results back from PDFs when using Full-Text Search:

In my last blog post, Setting up Full-Text Search for PDF files, I detailed how to get things setup.  If you tried this you may have noticed that although the searches worked, what you got back was a file name.  This isn’t so helpful if your document is an all encompassing 538 pages.  So, how do we get a page number back?  The best I’ve come up with so far is to split the 538 pages into 538 documents and load / search on those.

My first google search on how to split a pdf into pages came back with, http://www.splitpdf.com/, so I went ahead and used that.  I’m sure there is a way to do this through acrobat or even roll your own split functionality via the API.

It’s not a particularly pretty solution, but it does work, and that’s important.

Comments closed