Skip to content

Commit 352f953

Browse files
authored
Metadata filtering (asg017#124)
* initial pass at PARTITION KEY support. * Initial pass, allow auxiliary columns on vec0 virtual tables * update TODO * Initial pass at metadata filtering * unit tests * gha this PR branch * fixup tests * doc internal * fix tests, KNN/rowids in * define SQLITE_INDEX_CONSTRAINT_OFFSET * whoops * update tests, syrupy, use uv * un ignore pyproject.toml * dot * tests/ * type error? * win: .exe, update error name * try fix macos python, paren around expr? * win bash? * dbg :( * explicit error * op * dbg win * win ./tests/.venv/Scripts/python.exe * block UPDATEs on partition key values for now * test this branch * accidentally removved "partition key type mistmatch" block during merge * typo ugh * bruv * start aux snapshots * drop aux shadow table on destroy * enforce column types * block WHERE constraints on auxiliary columns in KNN queries * support delete * support UPDATE on auxiliary columns * test this PR * dont inline that * test-metadata.py * memzero text buffer * stress test * more snpashot tests * rm double/int32, just float/int64 * finish type checking * long text support * DELETE support * UPDATE support * fix snapshot names * drop not-used in eqp * small fixes * boolean comparison handling * ensure error is raised when long string constraint * new version string for beta builds * typo whoops * ann-filtering-benchmark directory * test-case * updates * fix aux column error when using non-default rowid values, needs test * refactor some text knn filtering * rowids blob read only on text metadata filters * refactor * add failing test causes for non eq text knn * text knn NE * test cases diff * GT * text knn GT/GE fixes * text knn LT/LE * clean * vtab_in handling * unblock aux failures for now * guard sqlite3_vtab_in * else in guard? * fixes and tests * add broken shadow table test * rename _metadata_chunksNN shadown table to _metadatachunksNN, for proper shadowName detection * _metadata_text_NN shadow tables to _metadatatextNN * SQLITE_VEC_VERSION_MAJOR SQLITE_VEC_VERSION_MINOR and SQLITE_VEC_VERSION_PATCH in sqlite-vec.h * _info shadow table * forgot to update aux snapshot? * fix aux tests
1 parent 9bfeaa7 commit 352f953

21 files changed

+7366
-110
lines changed

.github/workflows/test.yaml

+1
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@ on:
55
- main
66
- partition-by
77
- auxiliary
8+
- metadata-filtering
89
permissions:
910
contents: read
1011
jobs:

.gitignore

+5
Original file line numberDiff line numberDiff line change
@@ -26,3 +26,8 @@ sqlite-vec.h
2626
tmp/
2727

2828
poetry.lock
29+
30+
*.jsonl
31+
32+
memstat.c
33+
memstat.*

ARCHITECTURE.md

+75-7
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,51 @@
1+
# `sqlite-vec` Architecture
2+
3+
Internal documentation for how `sqlite-vec` works under-the-hood. Not meant for
4+
users of the `sqlite-vec` project, consult
5+
[the official `sqlite-vec` documentation](https://alexgarcia.xyz/sqlite-vec) for
6+
how-to-guides. Rather, this is for people interested in how `sqlite-vec` works
7+
and some guidelines to any future contributors.
8+
9+
Very much a WIP.
10+
111
## `vec0`
212

13+
### Shadow Tables
14+
15+
#### `xyz_chunks`
16+
17+
- `chunk_id INTEGER`
18+
- `size INTEGER`
19+
- `validity BLOB`
20+
- `rowids BLOB`
21+
22+
#### `xyz_rowids`
23+
24+
- `rowid INTEGER`
25+
- `id`
26+
- `chunk_id INTEGER`
27+
- `chunk_offset INTEGER`
28+
29+
#### `xyz_vector_chunksNN`
30+
31+
- `rowid INTEGER`
32+
- `vector BLOB`
33+
34+
#### `xyz_auxiliary`
35+
36+
- `rowid INTEGER`
37+
- `valueNN [type]`
38+
39+
#### `xyz_metadatachunksNN`
40+
41+
- `rowid INTEGER`
42+
- `data BLOB`
43+
44+
#### `xyz_metadatatextNN`
45+
46+
- `rowid INTEGER`
47+
- `data TEXT`
48+
349
### idxStr
450

551
The `vec0` idxStr is a string composed of single "header" character and 0 or
@@ -14,8 +60,11 @@ The "header" charcter denotes the type of query plan, as determined by the
1460
| `VEC0_QUERY_PLAN_POINT` | `'2'` | Perform a single-lookup point query for the provided rowid |
1561
| `VEC0_QUERY_PLAN_KNN` | `'3'` | Perform a KNN-style query on the provided query vector and parameters. |
1662

17-
Each 4-character "block" is associated with a corresponding value in `argv[]`. For example, the 1st block at byte offset `1-4` (inclusive) is the 1st block and is associated with `argv[1]`. The 2nd block at byte offset `5-8` (inclusive) is associated with `argv[2]` and so on. Each block describes what kind of value or filter the given `argv[i]` value is.
18-
63+
Each 4-character "block" is associated with a corresponding value in `argv[]`.
64+
For example, the 1st block at byte offset `1-4` (inclusive) is the 1st block and
65+
is associated with `argv[1]`. The 2nd block at byte offset `5-8` (inclusive) is
66+
associated with `argv[2]` and so on. Each block describes what kind of value or
67+
filter the given `argv[i]` value is.
1968

2069
#### `VEC0_IDXSTR_KIND_KNN_MATCH` (`'{'`)
2170

@@ -31,24 +80,43 @@ The remaining 3 characters of the block are `_` fillers.
3180

3281
#### `VEC0_IDXSTR_KIND_KNN_ROWID_IN` (`'['`)
3382

34-
`argv[i]` is the optional `rowid in (...)` value, and must be handled with [`sqlite3_vtab_in_first()` /
35-
`sqlite3_vtab_in_next()`](https://www.sqlite.org/c3ref/vtab_in_first.html).
83+
`argv[i]` is the optional `rowid in (...)` value, and must be handled with
84+
[`sqlite3_vtab_in_first()` / `sqlite3_vtab_in_next()`](https://www.sqlite.org/c3ref/vtab_in_first.html).
3685

3786
The remaining 3 characters of the block are `_` fillers.
3887

3988
#### `VEC0_IDXSTR_KIND_KNN_PARTITON_CONSTRAINT` (`']'`)
4089

4190
`argv[i]` is a "constraint" on a specific partition key.
4291

43-
The second character of the block denotes which partition key to filter on, using `A` to denote the first partition key column, `B` for the second, etc. It is encoded with `'A' + partition_idx` and can be decoded with `c - 'A'`.
92+
The second character of the block denotes which partition key to filter on,
93+
using `A` to denote the first partition key column, `B` for the second, etc. It
94+
is encoded with `'A' + partition_idx` and can be decoded with `c - 'A'`.
4495

45-
The third character of the block denotes which operator is used in the constraint. It will be one of the values of `enum vec0_partition_operator`, as only a subset of operations are supported on partition keys.
96+
The third character of the block denotes which operator is used in the
97+
constraint. It will be one of the values of `enum vec0_partition_operator`, as
98+
only a subset of operations are supported on partition keys.
4699

47100
The fourth character of the block is a `_` filler.
48101

49-
50102
#### `VEC0_IDXSTR_KIND_POINT_ID` (`'!'`)
51103

52104
`argv[i]` is the value of the rowid or id to match against for the point query.
53105

54106
The remaining 3 characters of the block are `_` fillers.
107+
108+
#### `VEC0_IDXSTR_KIND_METADATA_CONSTRAINT` (`'&'`)
109+
110+
`argv[i]` is the value of the `WHERE` constraint for a metdata column in a KNN
111+
query.
112+
113+
The second character of the block denotes which metadata column the constraint
114+
belongs to, using `A` to denote the first metadata column column, `B` for the
115+
second, etc. It is encoded with `'A' + metadata_idx` and can be decoded with
116+
`c - 'A'`.
117+
118+
The third character of the block is the constraint operator. It will be one of
119+
`enum vec0_metadata_operator`, as only a subset of operators are supported on
120+
metadata column KNN filters.
121+
122+
The foruth character of the block is a `_` filler.

Makefile

+3
Original file line numberDiff line numberDiff line change
@@ -153,6 +153,9 @@ sqlite-vec.h: sqlite-vec.h.tmpl VERSION
153153
VERSION=$(shell cat VERSION) \
154154
DATE=$(shell date -r VERSION +'%FT%TZ%z') \
155155
SOURCE=$(shell git log -n 1 --pretty=format:%H -- VERSION) \
156+
VERSION_MAJOR=$$(echo $$VERSION | cut -d. -f1) \
157+
VERSION_MINOR=$$(echo $$VERSION | cut -d. -f2) \
158+
VERSION_PATCH=$$(echo $$VERSION | cut -d. -f3 | cut -d- -f1) \
156159
envsubst < $< > $@
157160

158161
clean:

TODO

+16-12
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,17 @@
1-
# partition
1+
- [ ] add `xyz_info` shadow table with version etc.
22

3-
- [ ] UPDATE on partition key values
4-
- remove previous row from chunk, insert into new one?
5-
- [ ] properly sqlite3_vtab_nochange / sqlite3_value_nochange handling
6-
7-
# auxiliary columns
8-
9-
- later:
10-
- NOT NULL?
11-
- perf: INSERT stmt should be cached on vec0_vtab
12-
- perf: LEFT JOIN aux table to rowids query in vec0_cursor for rowid/point
13-
stmts, to avoid N lookup queries
3+
- later
4+
- [ ] partition: UPDATE support
5+
- [ ] skip invalid validity entries in knn filter?
6+
- [ ] nulls in metadata
7+
- [ ] partition `x in (...)` handling
8+
- [ ] blobs/date/datetime
9+
- [ ] uuid/ulid perf
10+
- [ ] Aux columns: `NOT NULL` constraint
11+
- [ ] Metadata columns: `NOT NULL` constraint
12+
- [ ] Partiion key: `NOT NULL` constraint
13+
- [ ] dictionary encoding?
14+
- [ ] properly sqlite3_vtab_nochange / sqlite3_value_nochange handling
15+
- [ ] perf
16+
- [ ] aux: cache INSERT
17+
- [ ] aux: LEFT JOIN on `_rowids` queries to avoid N lookup queries

0 commit comments

Comments
 (0)