Skip to content

Commit 376d9dc

Browse files
committed
Effective Search: Add article about Qualtrics Text iQ and CrateDB
Add article "Indexing Text for Both Effective Search and Accurate Analysis" by David Norton to "Explanation" section. Original source: https://web.archive.org/web/20250210021928/https://www.qualtrics.com/eng/indexing-text-for-both-effective-search-and-accurate-analysis/
1 parent 34f3b88 commit 376d9dc

File tree

3 files changed

+155
-23
lines changed

3 files changed

+155
-23
lines changed

docs/explain/index.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,11 @@ about applications and use cases of CrateDB, trying to put things into a
1313
bigger picture and joining things together to help answer the question _why_?
1414

1515

16+
:::{rubric} 2018
17+
:::
18+
19+
- {ref}`effective-fulltext-search`
20+
1621

1722
:::{note}
1823
You can expect the more recent documents to be more up-to-date than previous
Lines changed: 135 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,135 @@
1+
(effective-fulltext-search)=
2+
3+
# Indexing Text for Both Effective Search and Accurate Analysis
4+
5+
```{article-info}
6+
:avatar: https://images.squarespace-cdn.com/content/v1/613289e53dda745d13a61433/1634312742942-5UOECGMLC277DP2NC7V9/dave.png
7+
:avatar-outline: muted
8+
:author: David Norton, Qualtrics Text iQ Backend
9+
:date: June 29, 2018
10+
:read-time: 15 min read
11+
:class-container: sd-p-2 sd-outline-muted sd-rounded-1
12+
```
13+
14+
## Introduction
15+
16+
At Qualtrics, Text iQ is the tool that allows our users to find insights from their free response questions.
17+
18+
Powering Text iQ are various microservices that analyze the text and build models for everything from sentiment analysis to identifying key topics. In addition, Text iQ provides a framework for users to explore their text data through search. As the name suggests, we want all aspects of Text iQ, including search, to be intuitive and feel smart. All of this requires a storage solution that allows for effective text indexing as well as accurate and complete data aggregation and retrieval.
19+
20+
In this article we examine CrateDB, the technology we use for storage in Text iQ, as well as the actual text processing pipeline that we use to give us the indexing capabilities that we need.
21+
22+
## Why CrateDB?
23+
24+
Elasticsearch is one of the most popular technologies for effective indexing of text based data. So why did we choose CrateDB instead?
25+
26+
First, CrateDB actually uses Elasticsearch technology under the hood to manage cluster creation and communication. It also exposes an Elasticsearch API that gives us access to all of the indexing capabilities in Elasticsearch that we need.
27+
28+
Second, we need to be able to retrieve _exact_ aggregate information efficiently, which functionality CrateDB provides. We need exact aggregates because our text storage solution informs data reporting tools that are frequently under heavy scrutiny.
29+
30+
Finally, CrateDB’s SQL interface provides a favorable protocol for retrieving the extensive amounts of data that we use in various reporting tools and to train machine learning models.
31+
32+
## An Introduction to CrateDB Analyzers
33+
34+
Most major text search engines are built on Lucene. CrateDB and Elasticsearch are no exception. In Lucene, an analyzer is the processing pipeline used to create an index from raw text. It consists of a single tokenizer, and zero or more token/char filters.
35+
36+
The tokenizer is required as it seperates a raw text field into individual terms (or tokens) that can be indexed on. The most simple analyzers consist of nothing more than a tokenizer. The most basic tokenizer is the _whitespace_ tokenizer which separates terms with whitespace and ignores punctuation and symbols.
37+
38+
More sophisticated analyzers will include any number of token or character filters. Token filters further process individual terms enabling more generic searches. For example, the token filter _lowercase_ is used in nearly every analyzer and, as you might expect, transforms all characters into lowercase. This would enable a user to perform case insensitive searches within a corpus. Char filters are similar to token filters except that they process the raw text before it is tokenized; so they can’t manipulate the term sequences that come out of the tokenizers.
39+
40+
The default CrateDB analyzer uses a tokenizer based on [UAX #29: Unicode Text Segmentation], lowercases all tokens, and allows for optional stopword removal. Furthermore, CrateDB allows users to build custom analyzers from the same available tokenizers and filters that are available to ElasticSearch. While this is a large selection, it is not sufficient to reach the high bar we have set for Text iQ. Fortunately, it was relatively straightforward in CrateDB to add our own custom analyzer components via a java plugin.
41+
42+
## Processing Text for Intuitive and Smart Indexing
43+
44+
If a client was to search for "wlking to work", they would probably hope to get responses back like: "I walked to work", "I enjoy walkng to work", and "I walk to work every day". A human would have an easy time associating these responses; however, there is no combination of analyzer tokens or filters built into Elasticsearch that will allow you to get these results without other negative consequences. First of all, “walking” is spelled wrong. Second, different forms of the base word “walk” are desired but unlikely to be returned. We will elaborate on these problems and others momentarily, but suffice it to say that the off-the-shelf solutions for each are limited.
45+
46+
There is a fine line to walk however. A savvy analyst experienced in querying text and who knows their data, may be looking for misspelled words or particular word forms for a legitimate reason. This is not a line we want to walk, so we designed our indexing solution with both a _similar_ analyzer and _exact_ analyzer in mind. Our similar analyzer is the default method for straightforward queries; but, if desired double quotes can be used to dictate only an exact match on keywords. The exact analyzer also ensures that we can extract tokens in a more raw form for building models.
47+
48+
Much like CrateDB’s standard analyzer, our _similar_ analyzer separates words according to the [UAX #29: Unicode Text Segmentation specification] and ignores punctuation. However, it then performs conservative spell correction on the terms and identifies their base forms (a process known as lemmatization). The exact analyzer also separates words according to the same specification, but it keeps all punctuation as unique tokens, giving even more querying power to skilled users. Both analyzers are case insensitive, and use a custom character folding filter to enhance performance in non-English languages.
49+
50+
For the rest of this article, we will elaborate on the character folding, lemmatization, and spell correction techniques we have developed to power the indexing behind Text iQ.
51+
52+
## Character Folding
53+
54+
Did you know that there are at least 10 different apostrophe characters? When a user types the apostrophe key (next to the semicolon key) on a standard American keyboard, the unicode value that is typed changes depending on the program used and the symbol’s context.
55+
56+
For example, if I press the apostrophe key in a typical IDE, it’s going to be unicode #0027. But, in the word processor I’m using right now, it’s #2019 when used as an actual apostrophe. But if used as single quotes as in ‘ ’ there are two different unicode characters: This makes matching an index on an apostrophe potentially very difficult. What’s more, even if the UI you present to the user is very strict in how it receives inputs from the keyboard, the user can copy and paste text from anywhere. It’s the wild west when in comes to unicode representation and not just for apostrophes. How should you handle symbols like ä, ǣ, and ç for example?
57+
58+
This is one of the reasons you should always use character folding when indexing text. Character folding is the processes of mapping multiple types of characters into a single set of characters, usually one. Elasticsearch’s built in ASCIIFolding filter is particularly useful as it converts all unicode characters into the first 127 ASCII characters. So now there is only one representation of apostrophes, #0027.
59+
60+
The ASCIIFolding filter works great for English, but it results in some significant mistakes in other languages. For example, the filter will convert ä into “a”; but, in German this should be converted to “ae”. In French, ç should not be changed at all. And one more example, most of those nasty apostrophe characters mentioned above should actually be treated as double quotes in German (I’ve come to really hate apostrophes).
61+
62+
We’ve developed our own custom folding filter for Text iQ that wraps around the ASCIIFolding filter. This custom filter allows us to specify exceptional rules for any combination of language and analyzer type (i.e. _similar_ vs. _exact_).
63+
64+
## Lemmatization
65+
66+
Lemmatization is the process of identifying the base or dictionary form of a word. It differs from stemming which is the process of identifying the root part of a word and does not always yield an actual word. For example, the stem of the word “destabilized” is “stabil” where as the lemmatized form is “destabilize”.
67+
68+
Stemming is the only option available with built-in CrateDB filters. The motivation for using stemmers over lemmatizers is that stemming is generally easier to compute, and that the actual content of the index does not matter as long as the search results are accurate–it doesn’t matter that the index is populated with non-words.
69+
70+
However, we are using CrateDB as a working database in addition to a search engine. Our web-app can return vocabulary frequencies for visualizations like word clouds and we don’t want non-words populating these. Furthermore, we have found that all of the default stemming options to be inaccurate compared to a good lemmatizer. Returning to the “destabilize” example, would you want the search terms “stabilize” and “destabilize” to return the same results? Or how would you feel about the search term “organize” returning “organ”?
71+
72+
For our _similar_ analyzer we developed our own lemmatization filter using WordNet’s lemmatization library, Morphy. Compare the results of the Morphy lemmatizer with 3 available CrateDB stemmers:
73+
74+
| | porter | kstem | hunspell | WN Lemma
75+
|-----------------|----------|-------------|---------------|-------------|
76+
| **run** | run | run | run | run |
77+
| **ran** | ran | ran | ran | run |
78+
| **runs** | run | runs | run | run |
79+
| **running** | run | running | running | run |
80+
| **is** | is | is | i | be |
81+
| **are** | ar | are | are | be |
82+
| **was** | wa | was | was | be |
83+
| **race** | race | race | race | race |
84+
| **racing** | race | racing | race, racing* | race |
85+
| **races** | race | races | race | race |
86+
| **disorganize** | disorgan | disorganize | organize | disorganize |
87+
88+
* stemmers can sometimes produce multiple tokens at the same position.
89+
90+
Using Morphy, our lemmatizer is more comprehensive and accurate, and it guarantees real words.
91+
92+
Morphy enables our English solution for lemmatization; however, there are far fewer lemmatization libraries for non-english languages. In order to support lemmatization in any language, we create large dictionaries that map numerous words with their most common lemmatized form. The process for generating these dictionaries varies language by language. And while this approach isn’t quite as powerful as our approach for lemmatizing in English, it provides functionality beyond the stemming that exists with the built in filters for Elasticsearch and CrateDB.
93+
94+
## Spelling Correction
95+
96+
Elasticsearch doesn’t have any spelling correction filters. There is the hunspell stemmer, but it is just that, a stemmer. It uses hunspell dictionaries to perform stemming, but doesn’t do any spell correction.
97+
98+
The _fuzziness_ option under the _match_ predicate can be used for a form of spell correction. Tokens aren’t modified, but misspellings can be identified in a search. This can be problematic for short terms. E.g. “car” matches “can” with a fuzziness of 1. The _prefix_length_ option sets a minimum number of characters at the beginning of a term that must match. This can mitigate the above problem, but still isn’t a perfect solution. E.g. "mileage" matches "village" with a fuzziness of 2 and _prefix_length_ of 3.
99+
100+
As a poor man’s spell corrector, Elasticsearch recommends a fuzziness of 1 and a _prefix_length_ of 3. We wanted to do better than that.
101+
102+
We implemented our own spell correction filter that makes use of Lucene’s built in SpellChecker but uses a unique heuristic that combines three distinct dictionaries–each with a different purpose. We initialize Lucene’s Spellchecker with one dictionary for checking misspellings and a different one for offering suggestions based on Levenshtein distance. The checking dictionary (CHECKING_DICT) is the larger of the two and contains over 300,000 terms including proper nouns.
103+
104+
The suggesting dictionary (SUGGESTING_DICT) uses a selection of just over 100,000 words and no proper nouns. By using a more selective suggesting dictionary, we mitigate unlikely corrections. The third dictionary (COMMON_DICT) contains a mapping of about 5000 common spelling corrections such as _teh_ -> _the_. This third dictionary handles common mistakes that Lucene’s SpellChecker otherwise misses. In particular, it catches errors in small words (less than 4 characters) which are completely ignored by Lucene.
105+
106+
Our spell correction heuristic is as follows:
107+
108+
```text
109+
For each token:
110+
If not in CHECKING_DICT
111+
If not in COMMON_DICT
112+
If token length > 3
113+
Get top 10 suggestions from SpellChecker(SUGGESTING_DICT)
114+
return first suggestion with the same first char as token
115+
otherwise return token
116+
else return token
117+
else return COMMON_DICT(token)
118+
else return token
119+
```
120+
121+
## Summary
122+
123+
We have been able to develop the custom filters and analyzers described in this article as a single java plugin for CrateDB making deployment and upgrades straightforward. This has made all the difference in continuing to provide an intuitive experience for the users of Text iQ.
124+
125+
Users can get back the types of responses they would expect without ever realizing the processes occurring under the hood. And thanks to level of control CrateDB gives us, as developers we can access our data in a way that can support all of the powerful models that make Text iQ a signature experience.
126+
127+
128+
:::{note}
129+
The [original version] of this article was published on the former
130+
Qualtrics engineering blog.
131+
:::
132+
133+
134+
[original version]: https://web.archive.org/web/20250210021928/https://www.qualtrics.com/eng/indexing-text-for-both-effective-search-and-accurate-analysis/
135+
[UAX #29: Unicode Text Segmentation]: https://www.unicode.org/reports/tr29/

docs/feature/search/fts/index.md

Lines changed: 15 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -274,7 +274,7 @@ files, and corresponding technical backgrounds about their implementations.
274274
::::
275275

276276

277-
:::{rubric} Guides
277+
:::{rubric} Tutorials
278278
:::
279279

280280
::::{info-card}
@@ -298,7 +298,7 @@ by exploring how to manage a dataset of Netflix titles.
298298
::::
299299

300300

301-
:::{rubric} Articles
301+
:::{rubric} Explanations
302302
:::
303303

304304
::::{info-card}
@@ -338,34 +338,26 @@ Inverted Indexes for text values, BKD-Trees for numeric values, and Doc Values.
338338
::::
339339

340340

341-
::::{info-card}
342-
:::{grid-item}
343-
:columns: auto 9 9 9
344-
**Indexing Text for Both Effective Search and Accurate Analysis**
345-
346-
This article explores how Qualtrics uses CrateDB in Text iQ to provide text
347-
analysis services for everything from sentiment analysis to
348-
identifying key topics, and powerful search-based data exploration.
341+
:::{card} Indexing Text for Both Effective Search and Accurate Analysis
342+
:link: effective-fulltext-search
343+
:link-type: ref
349344

350-
{hyper-navigate}`Indexing Text for Both Effective Search and Accurate Analysis
351-
<[Indexing Text for Both Effective Search and Accurate Analysis]>`
345+
This article explores how Qualtrics uses CrateDB in their Text iQ product
346+
to provide text analysis services for everything from sentiment analysis
347+
to identifying key topics, and powerful search-based data exploration.
352348

349+
It explains integral parts of an FTS text processing pipeline,
350+
including analyzers, optionally using tokenizers or character filters,
351+
and how they can be customized to specific needs, using plugins for CrateDB.
352+
+++
353353
CrateDB uses Elasticsearch technology under the hood to manage cluster
354354
creation and communication, and also exposes an Elasticsearch API that provides
355355
access to all the indexing capabilities in Elasticsearch that Qualtrics needed.
356356

357-
The articles explains integral parts of an FTS text processing pipeline,
358-
including analyzers, optionally using tokenizers or character filters,
359-
and how they can be customized to specific needs, using plugins for CrateDB.
360-
361-
:::
362-
:::{grid-item}
363-
:columns: auto 3 3 3
364-
{tags-primary}`Introduction` \
365-
{tags-secondary}`Analyzer, Tokenizer` \
357+
{tags-primary}`Introduction`
358+
{tags-secondary}`Analyzer, Tokenizer`
366359
{tags-secondary}`Plugin`
367360
:::
368-
::::
369361

370362

371363
:::{toctree}
@@ -375,14 +367,14 @@ and how they can be customized to specific needs, using plugins for CrateDB.
375367
options
376368
analyzer
377369
Tutorial <tutorial>
370+
effective-search
378371
:::
379372

380373

381374
[BM25: The Next Generation of Lucene Relevance]: https://opensourceconnections.com/blog/2015/10/16/bm25-the-next-generation-of-lucene-relevation/
382375
[BM25 vs. Lucene Default Similarity]: https://www.elastic.co/blog/found-bm-vs-lucene-default-similarity
383376
[full-text search]: https://en.wikipedia.org/wiki/Full_text_search
384377
[Indexing and Storage in CrateDB]: https://cratedb.com/blog/indexing-and-storage-in-cratedb
385-
[Indexing Text for Both Effective Search and Accurate Analysis]: https://web.archive.org/web/20250210021928/https://www.qualtrics.com/eng/indexing-text-for-both-effective-search-and-accurate-analysis/
386378
[MATCH predicate]: inv:crate-reference#predicates_match
387379
[Okapi BM25]: https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/okapi_trec3.pdf
388380
[search engine]: https://en.wikipedia.org/wiki/Search_engine

0 commit comments

Comments
 (0)