Skip to content

Commit 274f983

Browse files
committed
Indexing and storage: Improve "doc values" section
1 parent 75407da commit 274f983

File tree

1 file changed

+24
-1
lines changed

1 file changed

+24
-1
lines changed

docs/feature/storage/indexing-and-storage.md

Lines changed: 24 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -224,7 +224,10 @@ and y in [9,11], the engine does the following:
224224

225225
## Doc values
226226

227-
Until Lucene 4.0 columns were indexed using an inverted index data structure
227+
:::{rubric} Data storage prior to Lucene 4.0
228+
:::
229+
230+
Until Lucene 4.0, columns were indexed using an inverted index data structure
228231
that maps terms to document ids. For searching documents by terms, this approach
229232
is effective and well-suited.
230233
However, if we have to find field values given document id, this solution
@@ -234,12 +237,29 @@ retrieval of data, it was necessary to traverse and extract all fields that
234237
appear in the collection of documents. This can cause memory and performance
235238
issues if we need to extract a large amount of data.
236239

240+
:::{rubric} What are doc values?
241+
:::
242+
237243
To improve the performance of aggregations and sorting, a new data structure was
238244
introduced, namely doc values. Doc values is a column-based data storage built
239245
at document index time. They store all field values that are not analyzed as
240246
strings in a compact column, making it more effective for sorting and
241247
aggregations.
242248

249+
> Doc values are Lucene's column-stride field value storage, letting you
250+
store numerics (single- or multivalued), sorted keywords (single or
251+
multivalued) and binary data blobs per document.
252+
These values are quite fast to access at search time, since they are
253+
stored column-stride such that only the value for that one field needs
254+
to be decoded per hit. This is in contrast to Lucene's stored document
255+
fields, which store all field values for one document together in a
256+
row-stride fashion, and are therefore relatively slow to access.
257+
>
258+
> -- [Document values with Apache Lucene]
259+
260+
:::{rubric} CrateDB's column store
261+
:::
262+
243263
CrateDB implements Column Store based on doc values in Lucene. The Column Store
244264
is created for each field in a document and generated as the following
245265
structures for fields in the Product table:
@@ -275,3 +295,6 @@ the following:
275295

276296
The use of Column Store results in a small disk footprint, thanks to specialized
277297
compression algorithms such as delta encoding, bit packing, and GCD.
298+
299+
300+
[Document values with Apache Lucene]: https://www.elastic.co/blog/sparse-versus-dense-document-values-with-apache-lucene

0 commit comments

Comments
 (0)