Allow the part files to be skipped when training FTS

The FTS index runs in two phases.

First, workers scan the column and tokenize the input.  If this tokenized input gets too large then the data is spilled to disk (part files).  This is controlled by `LANCE_FTS_PARTITION_SIZE`.

Second, the part files are scanned one at a time and written into one large index.  However, this large index is sharded.  As the large index gets too large it writes out a shard.  This is controlled by `LANCE_FTS_TARGET_SIZE`.  At search time these shards are searched in parallel.

We could skip the intermediate write if we interleave tokenizing with the construction of the final index.  When a part is big enough to spill, instead of spilling, it could be sent on a shared channel to a writer thread.  The writer thread would immediately flush the part into the index builder.  The index builder would then spill as normal (we would still have `LANCE_FTS_TARGET_SIZE` but no longer use `LANCE_FTS_PARTITION_SIZE`).  From my experiments this would cut the index building time roughly in half (perhaps even a better perf. boost on systems with many cores)

Perhaps more significantly, it would reduce the temporary disk space required to build an FTS index.

The downside is that it would possibly result in higher RAM consumption.  Although we could presumably limit the size of the channel between the tokenizer threads and the writer thread which should still bound the total RAM usage.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow the part files to be skipped when training FTS #5970

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Allow the part files to be skipped when training FTS #5970

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions