perf: Remove sort from add and update columns flow#259
perf: Remove sort from add and update columns flow#259hamersaw wants to merge 5 commits intolance-format:mainfrom
Conversation
Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>
Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>
Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>
|
ACTION NEEDED The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. For details on the error please inspect the "PR Title Check" action. |
|
Did a small Spark test locally - setup a 1 master / 3 worker test cluster locally with 2 CPU / 8G RAM for each. Wrote a lance dataset with 10m rows over 1k fragments (# of fragments / size of additional column(s) should be memory scaling for this approach). Then added a small column ( For the current sorting approach execution time took 27.482s and memory maxed around 1450M on each of the 3 workers: For the new (buffered) approach, runtime was 26.112s and memory use maxed at around 880M on each of the 3 workers: This is a little surprising, either (1) the sort is much more expensive then we thought or (2) this benchmark is very inaccurate. Regardless, It feels reasonable to open this PR for review. |
|
@jiaoew1991 interested in your thoughts here since this is riffing on your initial approach. |
Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>
Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>
Currently, we are sorting on
_rowaddrso that in the BatchWriter we can partition by fragment ID, where we have a single fragment writer open at a time when we sequentially iterate over the (sorted) input data. For very large datasets this sort operation can be expensive (ex. time + memory utilization). This PR removes the sort and instead creates per fragment buffers on the writer that are packed during record ingestion and then flushed oncommit. This should be faster on large datasets and the initial benchmark below shows that the fragment buffer packing actually has lower memory utilization than the currently required sort operation.Closes #255.