Skip to content

Allow the part files to be skipped when training FTS #5970

@westonpace

Description

@westonpace

The FTS index runs in two phases.

First, workers scan the column and tokenize the input. If this tokenized input gets too large then the data is spilled to disk (part files). This is controlled by LANCE_FTS_PARTITION_SIZE.

Second, the part files are scanned one at a time and written into one large index. However, this large index is sharded. As the large index gets too large it writes out a shard. This is controlled by LANCE_FTS_TARGET_SIZE. At search time these shards are searched in parallel.

We could skip the intermediate write if we interleave tokenizing with the construction of the final index. When a part is big enough to spill, instead of spilling, it could be sent on a shared channel to a writer thread. The writer thread would immediately flush the part into the index builder. The index builder would then spill as normal (we would still have LANCE_FTS_TARGET_SIZE but no longer use LANCE_FTS_PARTITION_SIZE). From my experiments this would cut the index building time roughly in half (perhaps even a better perf. boost on systems with many cores)

Perhaps more significantly, it would reduce the temporary disk space required to build an FTS index.

The downside is that it would possibly result in higher RAM consumption. Although we could presumably limit the size of the channel between the tokenizer threads and the writer thread which should still bound the total RAM usage.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions