|
4 | 4 |
|
5 | 5 | # Full-Text Search
|
6 | 6 |
|
7 |
| -Learn how to set up your database for full-text search, how to create the |
8 |
| -relevant indices, and how to query your text data efficiently. A must-read |
9 |
| -for anyone looking to make sense of large volumes of unstructured text data. |
| 7 | +:::{include} /_include/links.md |
| 8 | +::: |
| 9 | +:::{include} /_include/styles.html |
| 10 | +::: |
10 | 11 |
|
11 |
| -- [](inv:cloud#full-text) |
| 12 | +**BM25 term search based on Apache Lucene, using SQL: CrateDB is all you need.** |
12 | 13 |
|
| 14 | +:::::{grid} |
| 15 | +:padding: 0 |
| 16 | + |
| 17 | +::::{grid-item} |
| 18 | +:class: rubric-slim |
| 19 | +:columns: auto 9 9 9 |
| 20 | + |
| 21 | + |
| 22 | +:::{rubric} Overview |
| 23 | +::: |
| 24 | +CrateDB can be used as a database to conduct full-text search operations |
| 25 | +building upon the power of Apache Lucene. |
13 | 26 |
|
14 |
| -:::{note} |
15 | 27 | CrateDB is an exceptional choice for handling complex queries and large-scale
|
16 | 28 | data sets. One of its standout features are its full-text search capabilities,
|
17 | 29 | built on top of the powerful Lucene library. This makes it a great fit for
|
18 | 30 | organizing, searching, and analyzing extensive datasets.
|
| 31 | + |
| 32 | +:::{rubric} About |
| 33 | +::: |
| 34 | +[Full-text search] leverages the [BM25] search ranking algorithm, effectively |
| 35 | +implementing the storage and retrieval parts of a [search engine]. |
| 36 | + |
| 37 | +In information retrieval, Okapi BM25 (BM is an abbreviation of best matching) |
| 38 | +is a ranking function used by search engines to estimate the relevance of |
| 39 | +documents to a given search query. |
| 40 | +:::: |
| 41 | + |
| 42 | + |
| 43 | +::::{grid-item} |
| 44 | +:class: rubric-slim |
| 45 | +:columns: auto 3 3 3 |
| 46 | + |
| 47 | +```{rubric} Reference Manual |
| 48 | +``` |
| 49 | +- [](inv:crate-reference#sql_dql_fulltext_search) |
| 50 | +- [](inv:crate-reference#fulltext-indices) |
| 51 | +- [](inv:crate-reference#ref-create-analyzer) |
| 52 | + |
| 53 | +```{rubric} Related |
| 54 | +``` |
| 55 | +- {ref}`sql` |
| 56 | +- {ref}`vector` |
| 57 | +- {ref}`machine-learning` |
| 58 | +- {ref}`query` |
| 59 | + |
| 60 | +{tags-primary}`SQL` |
| 61 | +{tags-primary}`Full-Text Search` |
| 62 | +{tags-primary}`Okapi BM25` |
| 63 | +:::: |
| 64 | + |
| 65 | +::::: |
| 66 | + |
| 67 | + |
| 68 | +:::{rubric} Details |
19 | 69 | :::
|
| 70 | +CrateDB uses Lucene as a storage layer, so it inherits the implementation |
| 71 | +and concepts of Lucene, in the same spirit as Elasticsearch. |
| 72 | +The now popular BM25 method has become the default scoring formula in Lucene |
| 73 | +and is the scoring formula used by CrateDB. |
| 74 | + |
| 75 | +BM25 stands for "Best Match 25", the 25th iteration of this scoring algorithm. |
| 76 | +The excellent article [BM25: The Next Generation of Lucene Relevance] compares |
| 77 | +classic TF/IDF to [Okapi BM25], including illustrative graphs. |
| 78 | +To learn more details about what's inside, please also refer to [Similarity in |
| 79 | +Elasticsearch] and [BM25 vs. Lucene Default Similarity]. |
| 80 | + |
| 81 | +:::{div} |
| 82 | +While Elasticsearch uses a [query DSL based on JSON], in CrateDB, you can work |
| 83 | +with text search using SQL. |
| 84 | +::: |
| 85 | + |
| 86 | + |
| 87 | +## Synopsis |
| 88 | + |
| 89 | +Store and query word embeddings using similarity search based on Cosine |
| 90 | +distance. |
| 91 | + |
| 92 | +::::{grid} |
| 93 | +:padding: 0 |
| 94 | +:class-row: title-slim |
| 95 | + |
| 96 | +:::{grid-item} **DDL** |
| 97 | +:columns: auto 6 6 6 |
| 98 | + |
| 99 | +```sql |
| 100 | +CREATE TABLE documents ( |
| 101 | + name STRING PRIMARY KEY, |
| 102 | + description TEXT, |
| 103 | + INDEX ft_english |
| 104 | + USING FULLTEXT(description) WITH ( |
| 105 | + analyzer = 'english' |
| 106 | + ), |
| 107 | + INDEX ft_german |
| 108 | + USING FULLTEXT(description) WITH ( |
| 109 | + analyzer = 'german' |
| 110 | + ) |
| 111 | +); |
| 112 | +``` |
| 113 | +::: |
| 114 | + |
| 115 | +:::{grid-item} **DQL** |
| 116 | +:columns: auto 6 6 6 |
| 117 | + |
| 118 | +```sql |
| 119 | +SELECT name, _score |
| 120 | +FROM documents |
| 121 | +WHERE |
| 122 | + MATCH( |
| 123 | + (ft_english, ft_german), |
| 124 | + 'jump OR verwahrlost' |
| 125 | + ) |
| 126 | +ORDER BY _score DESC; |
| 127 | +``` |
| 128 | +::: |
| 129 | + |
| 130 | +:::: |
| 131 | + |
| 132 | + |
| 133 | +::::{grid} |
| 134 | +:padding: 0 |
| 135 | +:class-row: title-slim |
| 136 | + |
| 137 | +:::{grid-item} **DML** |
| 138 | +:columns: auto 6 6 6 |
| 139 | + |
| 140 | +```sql |
| 141 | +INSERT INTO documents (name, description) |
| 142 | +VALUES |
| 143 | + ('Quick fox', 'The quick brown fox jumps over the lazy dog.'), |
| 144 | + ('Franz jagt', 'Franz jagt im komplett verwahrlosten Taxi quer durch Bayern.') |
| 145 | +; |
| 146 | +``` |
| 147 | +::: |
| 148 | + |
| 149 | +:::{grid-item} **Result** |
| 150 | +:columns: auto 6 6 6 |
| 151 | + |
| 152 | +```text |
| 153 | ++------------+------------+ |
| 154 | +| name | _score | |
| 155 | ++------------+------------+ |
| 156 | +| Franz jagt | 0.13076457 | |
| 157 | +| Quick fox | 0.13076457 | |
| 158 | ++------------+------------+ |
| 159 | +SELECT 2 rows in set (0.034 sec) |
| 160 | +``` |
| 161 | +::: |
| 162 | + |
| 163 | +:::: |
| 164 | + |
| 165 | + |
| 166 | +## Usage |
| 167 | + |
| 168 | +Using full-text search in CrateDB. |
| 169 | + |
| 170 | +:::{rubric} `MATCH` predicate |
| 171 | +::: |
| 172 | +CrateDB's [MATCH predicate] performs a fulltext search on one or more indexed |
| 173 | +columns or indices and supports different matching techniques. |
| 174 | + |
| 175 | +In order to use fulltext searches on a column, a [fulltext index with an |
| 176 | +analyzer](inv:crate-reference#sql_ddl_index_fulltext) must be created for |
| 177 | +this column. |
| 178 | + |
| 179 | +:::{rubric} Analyzer |
| 180 | +::: |
| 181 | +Analyzers consist of two parts, filters, and tokenizers. Each analyzer must |
| 182 | +contain one tokenizer and only one tokenizer can be used. |
| 183 | + |
| 184 | +Tokenizers decide how to divide the given text into parts. Filters perform |
| 185 | +a series of transformations by passing the given text through a number of |
| 186 | +operations. They are divided into token filters and character filters, |
| 187 | +discriminating between filters applied before, or after the tokenization |
| 188 | +step. |
| 189 | + |
| 190 | +Popular filters are stopword lists, lowercase transformations, or word |
| 191 | +stemmers. |
| 192 | +The excellent article [Improve Your Text Search with Lucene Analyzers] |
| 193 | +illustrates more details about this topic on behalf of Elasticsearch. |
| 194 | + |
| 195 | + |
| 196 | + |
| 197 | +## Learn |
| 198 | + |
| 199 | +Learn how to set up your database for full-text search, how to create the |
| 200 | +relevant indices, and how to query your text data efficiently. A must-read |
| 201 | +for anyone looking to make sense of large volumes of unstructured text data. |
| 202 | + |
| 203 | +:::{rubric} Tutorials |
| 204 | +::: |
| 205 | + |
| 206 | +::::{info-card} |
| 207 | +:::{grid-item} **Full-Text: Exploring the Netflix Catalog** |
| 208 | +:columns: auto 9 9 9 |
| 209 | + |
| 210 | +The tutorial illustrates the BM25 ranking algorithm for information retrieval, |
| 211 | +by exploring how to manage a dataset of Netflix titles. |
| 212 | + |
| 213 | +{{ '{}(inv:cloud#full-text)'.format(tutorial) }} |
| 214 | +::: |
| 215 | +:::{grid-item} |
| 216 | +:columns: auto 3 3 3 |
| 217 | +{tags-primary}`Introduction` \ |
| 218 | +{tags-secondary}`Full-Text Search` \ |
| 219 | +{tags-secondary}`BM25` \ |
| 220 | +{tags-secondary}`SQL` |
| 221 | +::: |
| 222 | +:::: |
| 223 | + |
| 224 | + |
| 225 | +::::{info-card} |
| 226 | +:::{grid-item} **Custom analyzers for fuzzy text matching** |
| 227 | +:columns: auto 9 9 9 |
| 228 | + |
| 229 | +The community discussion illustrates how to define custom analyzers for |
| 230 | +enabling fuzzy searching, how to use synonym files, and corresponding |
| 231 | +technical backgrounds about their implementations. |
| 232 | + |
| 233 | +{{ '{}[custom-analyzers-fuzzy]'.format(tutorial) }} |
| 234 | +::: |
| 235 | +:::{grid-item} |
| 236 | +:columns: auto 3 3 3 |
| 237 | +{tags-primary}`Introduction` \ |
| 238 | +{tags-secondary}`Full-Text Search` \ |
| 239 | +{tags-secondary}`Lucene Analyzer` \ |
| 240 | +{tags-secondary}`SQL` |
| 241 | +::: |
| 242 | +:::: |
| 243 | + |
| 244 | + |
| 245 | +[BM25]: https://en.wikipedia.org/wiki/Okapi_BM25 |
| 246 | +[BM25: The Next Generation of Lucene Relevance]: https://opensourceconnections.com/blog/2015/10/16/bm25-the-next-generation-of-lucene-relevation/ |
| 247 | +[BM25 vs. Lucene Default Similarity]: https://www.elastic.co/blog/found-bm-vs-lucene-default-similarity |
| 248 | +[custom-analyzers-fuzzy]: https://community.cratedb.com/t/fuzzy-search-synonyms/889 |
| 249 | +[full-text search]: https://en.wikipedia.org/wiki/Full_text_search |
| 250 | +[Improve Your Text Search with Lucene Analyzers]: https://medium.com/@dagliberkay/elastic-text-search-6b778de9b753 |
| 251 | +[MATCH predicate]: inv:crate-reference#predicates_match |
| 252 | +[Okapi BM25]: https://trec.nist.gov/pubs/trec3/papers/city.ps.gz |
| 253 | +[search engine]: https://en.wikipedia.org/wiki/Search_engine |
| 254 | +[Similarity in Elasticsearch]: https://www.elastic.co/blog/found-similarity-in-elasticsearch |
| 255 | +[TREC-3 proceedings]: https://trec.nist.gov/pubs/trec3/t3_proceedings.html |
0 commit comments