The FTS index runs in two phases.
First, workers scan the column and tokenize the input. If this tokenized input gets too large then the data is spilled to disk (part files). This is controlled by LANCE_FTS_PARTITION_SIZE.
Second, the part files are scanned one at a time and written into one large index. However, this large index is sharded. As the large index gets too large it writes out a shard. This is controlled by LANCE_FTS_TARGET_SIZE. At search time these shards are searched in parallel.
We could skip the intermediate write if we interleave tokenizing with the construction of the final index. When a part is big enough to spill, instead of spilling, it could be sent on a shared channel to a writer thread. The writer thread would immediately flush the part into the index builder. The index builder would then spill as normal (we would still have LANCE_FTS_TARGET_SIZE but no longer use LANCE_FTS_PARTITION_SIZE). From my experiments this would cut the index building time roughly in half (perhaps even a better perf. boost on systems with many cores)
Perhaps more significantly, it would reduce the temporary disk space required to build an FTS index.
The downside is that it would possibly result in higher RAM consumption. Although we could presumably limit the size of the channel between the tokenizer threads and the writer thread which should still bound the total RAM usage.
The FTS index runs in two phases.
First, workers scan the column and tokenize the input. If this tokenized input gets too large then the data is spilled to disk (part files). This is controlled by
LANCE_FTS_PARTITION_SIZE.Second, the part files are scanned one at a time and written into one large index. However, this large index is sharded. As the large index gets too large it writes out a shard. This is controlled by
LANCE_FTS_TARGET_SIZE. At search time these shards are searched in parallel.We could skip the intermediate write if we interleave tokenizing with the construction of the final index. When a part is big enough to spill, instead of spilling, it could be sent on a shared channel to a writer thread. The writer thread would immediately flush the part into the index builder. The index builder would then spill as normal (we would still have
LANCE_FTS_TARGET_SIZEbut no longer useLANCE_FTS_PARTITION_SIZE). From my experiments this would cut the index building time roughly in half (perhaps even a better perf. boost on systems with many cores)Perhaps more significantly, it would reduce the temporary disk space required to build an FTS index.
The downside is that it would possibly result in higher RAM consumption. Although we could presumably limit the size of the channel between the tokenizer threads and the writer thread which should still bound the total RAM usage.