Skip to content

Commit 9aa4922

Browse files
committed
Feature / Search: Implement page
1 parent 6afd38d commit 9aa4922

File tree

1 file changed

+241
-5
lines changed

1 file changed

+241
-5
lines changed

docs/feature/search/index.md

Lines changed: 241 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -4,16 +4,252 @@
44

55
# Full-Text Search
66

7-
Learn how to set up your database for full-text search, how to create the
8-
relevant indices, and how to query your text data efficiently. A must-read
9-
for anyone looking to make sense of large volumes of unstructured text data.
7+
:::{include} /_include/links.md
8+
:::
9+
:::{include} /_include/styles.html
10+
:::
1011

11-
- [](inv:cloud#full-text)
12+
**BM25 term search based on Apache Lucene, using SQL: CrateDB is all you need.**
1213

14+
:::::{grid}
15+
:padding: 0
16+
17+
::::{grid-item}
18+
:class: rubric-slim
19+
:columns: auto 9 9 9
20+
21+
22+
:::{rubric} Overview
23+
:::
24+
CrateDB can be used as a database to conduct full-text search operations
25+
building upon the power of Apache Lucene.
1326

14-
:::{note}
1527
CrateDB is an exceptional choice for handling complex queries and large-scale
1628
data sets. One of its standout features are its full-text search capabilities,
1729
built on top of the powerful Lucene library. This makes it a great fit for
1830
organizing, searching, and analyzing extensive datasets.
31+
32+
:::{rubric} About
33+
:::
34+
[Full-text search] leverages the [BM25] search ranking algorithm, effectively
35+
implementing the storage and retrieval parts of a [search engine].
36+
37+
In information retrieval, Okapi BM25 (BM is an abbreviation of best matching)
38+
is a ranking function used by search engines to estimate the relevance of
39+
documents to a given search query.
40+
::::
41+
42+
43+
::::{grid-item}
44+
:class: rubric-slim
45+
:columns: auto 3 3 3
46+
47+
```{rubric} Reference Manual
48+
```
49+
- [](inv:crate-reference#sql_dql_fulltext_search)
50+
- [](inv:crate-reference#fulltext-indices)
51+
- [](inv:crate-reference#ref-create-analyzer)
52+
53+
```{rubric} Related
54+
```
55+
- {ref}`sql`
56+
- {ref}`vector`
57+
- {ref}`machine-learning`
58+
- {ref}`query`
59+
60+
{tags-primary}`SQL`
61+
{tags-primary}`Full-Text Search`
62+
{tags-primary}`Okapi BM25`
63+
::::
64+
65+
:::::
66+
67+
68+
:::{rubric} Details
1969
:::
70+
CrateDB uses Lucene as a storage layer, so it inherits the implementation
71+
and concepts of Lucene, in the same spirit as Elasticsearch.
72+
The now popular BM25 method has become the default scoring formula in Lucene
73+
and is the scoring formula used by CrateDB.
74+
75+
BM25 stands for "Best Match 25", the 25th iteration of this scoring algorithm.
76+
The excellent article [BM25: The Next Generation of Lucene Relevance] compares
77+
classic TF/IDF to [Okapi BM25], including illustrative graphs.
78+
To learn more details about what's inside, please also refer to [Similarity in
79+
Elasticsearch] and [BM25 vs. Lucene Default Similarity].
80+
81+
:::{div}
82+
While Elasticsearch uses a [query DSL based on JSON], in CrateDB, you can work
83+
with text search using SQL.
84+
:::
85+
86+
87+
## Synopsis
88+
89+
Store and query word embeddings using similarity search based on Cosine
90+
distance.
91+
92+
::::{grid}
93+
:padding: 0
94+
:class-row: title-slim
95+
96+
:::{grid-item} **DDL**
97+
:columns: auto 6 6 6
98+
99+
```sql
100+
CREATE TABLE documents (
101+
name STRING PRIMARY KEY,
102+
description TEXT,
103+
INDEX ft_english
104+
USING FULLTEXT(description) WITH (
105+
analyzer = 'english'
106+
),
107+
INDEX ft_german
108+
USING FULLTEXT(description) WITH (
109+
analyzer = 'german'
110+
)
111+
);
112+
```
113+
:::
114+
115+
:::{grid-item} **DQL**
116+
:columns: auto 6 6 6
117+
118+
```sql
119+
SELECT name, _score
120+
FROM documents
121+
WHERE
122+
MATCH(
123+
(ft_english, ft_german),
124+
'jump OR verwahrlost'
125+
)
126+
ORDER BY _score DESC;
127+
```
128+
:::
129+
130+
::::
131+
132+
133+
::::{grid}
134+
:padding: 0
135+
:class-row: title-slim
136+
137+
:::{grid-item} **DML**
138+
:columns: auto 6 6 6
139+
140+
```sql
141+
INSERT INTO documents (name, description)
142+
VALUES
143+
('Quick fox', 'The quick brown fox jumps over the lazy dog.'),
144+
('Franz jagt', 'Franz jagt im komplett verwahrlosten Taxi quer durch Bayern.')
145+
;
146+
```
147+
:::
148+
149+
:::{grid-item} **Result**
150+
:columns: auto 6 6 6
151+
152+
```text
153+
+------------+------------+
154+
| name | _score |
155+
+------------+------------+
156+
| Franz jagt | 0.13076457 |
157+
| Quick fox | 0.13076457 |
158+
+------------+------------+
159+
SELECT 2 rows in set (0.034 sec)
160+
```
161+
:::
162+
163+
::::
164+
165+
166+
## Usage
167+
168+
Using full-text search in CrateDB.
169+
170+
:::{rubric} `MATCH` predicate
171+
:::
172+
CrateDB's [MATCH predicate] performs a fulltext search on one or more indexed
173+
columns or indices and supports different matching techniques.
174+
175+
In order to use fulltext searches on a column, a [fulltext index with an
176+
analyzer](inv:crate-reference#sql_ddl_index_fulltext) must be created for
177+
this column.
178+
179+
:::{rubric} Analyzer
180+
:::
181+
Analyzers consist of two parts, filters, and tokenizers. Each analyzer must
182+
contain one tokenizer and only one tokenizer can be used.
183+
184+
Tokenizers decide how to divide the given text into parts. Filters perform
185+
a series of transformations by passing the given text through a number of
186+
operations. They are divided into token filters and character filters,
187+
discriminating between filters applied before, or after the tokenization
188+
step.
189+
190+
Popular filters are stopword lists, lowercase transformations, or word
191+
stemmers.
192+
The excellent article [Improve Your Text Search with Lucene Analyzers]
193+
illustrates more details about this topic on behalf of Elasticsearch.
194+
195+
196+
197+
## Learn
198+
199+
Learn how to set up your database for full-text search, how to create the
200+
relevant indices, and how to query your text data efficiently. A must-read
201+
for anyone looking to make sense of large volumes of unstructured text data.
202+
203+
:::{rubric} Tutorials
204+
:::
205+
206+
::::{info-card}
207+
:::{grid-item} **Full-Text: Exploring the Netflix Catalog**
208+
:columns: auto 9 9 9
209+
210+
The tutorial illustrates the BM25 ranking algorithm for information retrieval,
211+
by exploring how to manage a dataset of Netflix titles.
212+
213+
{{ '{}(inv:cloud#full-text)'.format(tutorial) }}
214+
:::
215+
:::{grid-item}
216+
:columns: auto 3 3 3
217+
{tags-primary}`Introduction` \
218+
{tags-secondary}`Full-Text Search` \
219+
{tags-secondary}`BM25` \
220+
{tags-secondary}`SQL`
221+
:::
222+
::::
223+
224+
225+
::::{info-card}
226+
:::{grid-item} **Custom analyzers for fuzzy text matching**
227+
:columns: auto 9 9 9
228+
229+
The community discussion illustrates how to define custom analyzers for
230+
enabling fuzzy searching, how to use synonym files, and corresponding
231+
technical backgrounds about their implementations.
232+
233+
{{ '{}[custom-analyzers-fuzzy]'.format(tutorial) }}
234+
:::
235+
:::{grid-item}
236+
:columns: auto 3 3 3
237+
{tags-primary}`Introduction` \
238+
{tags-secondary}`Full-Text Search` \
239+
{tags-secondary}`Lucene Analyzer` \
240+
{tags-secondary}`SQL`
241+
:::
242+
::::
243+
244+
245+
[BM25]: https://en.wikipedia.org/wiki/Okapi_BM25
246+
[BM25: The Next Generation of Lucene Relevance]: https://opensourceconnections.com/blog/2015/10/16/bm25-the-next-generation-of-lucene-relevation/
247+
[BM25 vs. Lucene Default Similarity]: https://www.elastic.co/blog/found-bm-vs-lucene-default-similarity
248+
[custom-analyzers-fuzzy]: https://community.cratedb.com/t/fuzzy-search-synonyms/889
249+
[full-text search]: https://en.wikipedia.org/wiki/Full_text_search
250+
[Improve Your Text Search with Lucene Analyzers]: https://medium.com/@dagliberkay/elastic-text-search-6b778de9b753
251+
[MATCH predicate]: inv:crate-reference#predicates_match
252+
[Okapi BM25]: https://trec.nist.gov/pubs/trec3/papers/city.ps.gz
253+
[search engine]: https://en.wikipedia.org/wiki/Search_engine
254+
[Similarity in Elasticsearch]: https://www.elastic.co/blog/found-similarity-in-elasticsearch
255+
[TREC-3 proceedings]: https://trec.nist.gov/pubs/trec3/t3_proceedings.html

0 commit comments

Comments
 (0)