Press "Enter" to skip to content

Category: Search

Creating an Elasticsearch Pipeline

The Big Data in Real World team builds a pipeline:

A pipeline is a definition of a series of processors that are to be executed in the same order as they are declared. 

Think of a processor as a series of instructions that will be executed.

In this post we are going to create a pipeline to add a field named doc_timestamp to all the documents that are added to the index.

Click through for the process. In Elasticsearch, ingest pipelines aren’t for moving data but rather for performing some common operations or tasks prior to indexing the data.

Leave a Comment

Role-Based Access Controls in Amazon OpenSearch

Scott Chang and Muthu Pitchaimani show how to assign rights in Amazon OpenSearch to IAM groups:

Amazon OpenSearch Service is a managed service that makes it simple to secure, deploy, and operate OpenSearch clusters at scale in the AWS Cloud. AWS IAM Identity Center (successor to AWS Single Sign-On) helps you securely create or connect your workforce identities and manage their access centrally across AWS accounts and applications. To build a strong least-privilege security posture, customers also wanted fine-grained access control to manage dashboard permission by user role. In this post, we demonstrate a step-by-step procedure to implement IAM Identity Center to OpenSearch Service via native SAML integration, and configure role-based access control in OpenSearch Dashboards by using group attributes in IAM Identity Center. You can follow the steps in this post to achieve both authentication and authorization for OpenSearch Service based on the groups configured in IAM Identity Center.

Click through for the process.

Leave a Comment

Understanding Azure Cognitive Search Costs

Matt Eland doesn’t want to break the bank:

Let’s continue my recent trend in exploring pricing tips for the various parts of AI and Machine Learning on Azure with a dive into Azure Cognitive Search.

Sometimes confused with the AI offerings of Azure Cognitive Services, the entirely different Azure Cognitive Search is a rich service that allows you to index a variety of files and documents, extract meaning from those documents, and provide rich search results to users.

In this article we’ll explore the pricing structure of Azure Cognitive Search and highlight some things you should be aware of as you plan and develop your Cognitive Search resources.

Read the whole thing if you’re thinking of using Azure Cognitive Search. It’s a good service and I think the pricing model is fairly straightforward, though there are always nuances to these things.

Leave a Comment

Semantic Search in Azure Cognitive Search

Rangan Majumder, et al, have an article on how semantic search works in Azure Cognitive Search:

As part of our AI at Scale effort, we lean heavily on recent developments in large Transformer-based language models to improve the relevance quality of Microsoft Bing. These improvements allow a search engine to go beyond keyword matching to searching using the semantic meaning behind words and content. We call this transformational ability semantic search—a major showcase of what AI at Scale can deliver for customers.

Semantic search has significantly advanced the quality of Bing search results, and it has been a companywide effort: top applied scientists and engineers from Bing leverage the latest technology from Microsoft Research and Microsoft Azure. Maximizing the power of AI at Scale requires a lot of sophistication. One needs to pretrain large Transformer-based models, perform multi-task fine-tuning across various tasks, and distill big models to a servable form with very minimal loss of quality. We recognize that it takes a large group of specialized talent to integrate and deploy AI at Scale products for customers, and many companies can’t afford these types of teams. To empower every person and every organization on the planet, we need to significantly lower the bar for everyone to use AI at Scale technology.

Click through to learn more about the technology.

Comments closed

The Unbearable Slowness of Full Text Queries

Brent Ozar explains why full-text search in SQL Server can be so slow:

SQL Server’s full text search is amazing. Well, it amazes me at least – it has so many cool capabilities: looking for prefixes, words near each other, different verb tenses, and even thesaurus searches. However, that’s not how I see most people using it: I’ve seen so many shops using it for matching specific strings, thinking it’s going to be faster than LIKE ‘%mysearch%’. That works at small scale, but as your data grows, you run into a query plan performance problem.

When your query uses CONTAINS, SQL Server has a nasty habit of doing a full text search across all of the rows in the table rather than using the rest of your WHERE clause to reduce the result set first.

Read on for the full impact as well as some alternatives. I agree that those alternatives come with costs (whether that be monetary or conceptual), but I’ve used both n-grams and Elasticsearch with some success.

Comments closed

Using Stopwords and Stoplists with Full-Text Search

Haroon Ashraf walks us through stoplists and stopwords in SQL Server Full-Text Search:

First, let’s clarify the essence of Stopwords and Stoplist. Then we’ll proceed to use them to improve Full-Text Search.

A Stoplist

A stoplist, as the name implies, is a list of stopwords. When associated with Full-Text Search, the Stoplist can filter out meaningless words or terms, thus improving search results.

A Stopword

A stopword is a word that has a minor role in Full-Text Search, despite being important grammatically. Therefore, a stopword is not essential from the Full-Text Search perspective.

According to Microsoft documentation, a stopword can be a word with some meaning in a specific language, or it may be some token with no linguistic value. In both cases, it is useless for the Full-Text Search.

Read on to see examples and how to build your own stoplists.

Comments closed

Semantic Search in SQL Server

Haroon Ashraf wraps up a series:

Being the final part of the article, it is going to take you to the next level of analyzing word documents stored in Windows folders, managed by File Table, and consumed by Semantic Search.

Additionally, the readers are going to gain more understanding of Semantic Search and how to make it work with MS Word documents for analysis.

This article provides a name-based analysis of the documents with equal attention to both theory and practice.

Click through for the culmination of all of this filestream work.

Comments closed

Semantic Search with FileTable

Haroon Ashraf continues a series on semantic search with Windows and SQL Server:

The focus of the article is on comparing documents that can be stored on Windows File System in one respect and in the other respect their comparative analysis that can be performed with Semantic Search in SQL Server.

Additionally, the readers will learn how to store unstructured data by exploring File Table and creating MS Word documents on the fly (instantly) to be consumed by Semantic Search.

This part of the article is related to the use of Semantic Search on unstructured data for the extraction of basic level business-crucial information provided standard naming is in place.

Click through for the article.

Comments closed

The Decline(?) Of Google Search

Vincent Granville argues that Google search is on a slow decline:

What has happened over the last few years is that many websites are now getting most of their traffic from sources other than Google. Google is no longer the main source of traffic for most websites, because webmasters pursue other avenues to generate relevant traffic, in particular social networks and newsletter – as it is easier to attract the right people and promote the right content through these channels. Think about this: How did you discover Data Science Central? For most recent members, the answer is not Google anymore. In that sense, Google has lost its monopoly when it comes to finding interesting information on the Internet. The reason is that Google pushes more and more search results from partners, their own products, possibly content that fits with its political agenda, big advertisers, old websites, big websites, and web spammers who find a way to get listed at the top. In the meanwhile, websites such as ours promote more and more articles from little high quality publishers and great bloggers that have a hard time getting decent traffic from Google. For them, we are a much bigger and better source of traffic, than Google.

I think this is a fairly optimistic view of the situation, as there’s a difference between “I want to learn about a topic” versus “I want to learn this specific thing.”  I think Vincent’s argument is much stronger on the former, but when it comes to the latter, the first thing I hear people say is that they’re googling it.

Comments closed

Trigram Search In SQL Server

Paul White shows how to implement trigram wildcard searches in SQL Server:

The basic idea of a trigram search is quite simple:

  1. Persist three-character substrings (trigrams) of the target data.
  2. Split the search term(s) into trigrams.
  3. Match search trigrams against the stored trigrams (equality search)
  4. Intersect the qualified rows to find strings that match all trigrams
  5. Apply the original search filter to the much-reduced intersection

We will work through an example to see exactly how this all works, and what the trade-offs are.

A must-read.  N-grams in SQL Server is an example of a non-obvious data architecture which performs much better than the obvious alternative, at least when the conditions are right.

Comments closed