Skip to content

perf: Remove sort from add and update columns flow#259

Open
hamersaw wants to merge 5 commits intolance-format:mainfrom
hamersaw:feature/buffer-writer
Open

perf: Remove sort from add and update columns flow#259
hamersaw wants to merge 5 commits intolance-format:mainfrom
hamersaw:feature/buffer-writer

Conversation

@hamersaw
Copy link
Collaborator

@hamersaw hamersaw commented Feb 20, 2026

Currently, we are sorting on _rowaddr so that in the BatchWriter we can partition by fragment ID, where we have a single fragment writer open at a time when we sequentially iterate over the (sorted) input data. For very large datasets this sort operation can be expensive (ex. time + memory utilization). This PR removes the sort and instead creates per fragment buffers on the writer that are packed during record ingestion and then flushed on commit. This should be faster on large datasets and the initial benchmark below shows that the fragment buffer packing actually has lower memory utilization than the currently required sort operation.

Closes #255.

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>
Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>
Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>
@github-actions
Copy link
Contributor

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

@hamersaw hamersaw changed the title Remove sort from ADD / UPDATE COLUMNS flow feat: Remove sort from ADD / UPDATE COLUMNS flow Feb 20, 2026
@github-actions github-actions bot added the enhancement New feature or request label Feb 20, 2026
@hamersaw hamersaw changed the title feat: Remove sort from ADD / UPDATE COLUMNS flow perf: Remove sort from ADD / UPDATE COLUMNS flow Feb 20, 2026
@github-actions github-actions bot added the performance Features that improves performance label Feb 20, 2026
@hamersaw hamersaw changed the title perf: Remove sort from ADD / UPDATE COLUMNS flow perf: Remove sort from add and update columns flow Feb 20, 2026
@hamersaw
Copy link
Collaborator Author

hamersaw commented Feb 22, 2026

Did a small Spark test locally - setup a 1 master / 3 worker test cluster locally with 2 CPU / 8G RAM for each. Wrote a lance dataset with 10m rows over 1k fragments (# of fragments / size of additional column(s) should be memory scaling for this approach). Then added a small column (hash(int)) and compared the existing (sorted) to this (fragment buffered).

For the current sorting approach execution time took 27.482s and memory maxed around 1450M on each of the 3 workers:

== Memory Used (MiB) ==
 1466 ┤                                            ╭──────────────
 1381 ┤                                           ╭╯╭─────────────
 1296 ┤                                          ╭╯─╯
 1210 ┤                                         ╭╯│
 1125 ┤                                      ╭──╯─╯
 1040 ┤                                      ││╭╯
  955 ┤                                     ╭│╭╯
  869 ┤                                ╭───╮╭╯╯
  784 ┤                         ╭─────╭─────╯╯
  699 ┤             ╭──╭──────────────╯─╯
  613 ┤     ╭─────╭────╯──────────────────────────────────────────
  528 ┼─────╯   ╭╭╯
  443 ┼──────────╯
  358 ┤
  272 ┤
  187 ┼───────────────────────────────────────────────────────────
                            Memory Used (MiB)

       ■ spark-worker-run-99611c5c6c51   ■ spark-worker-2   ■ spark-worker-1   ■ spark-worker-3   ■ spark-master

For the new (buffered) approach, runtime was 26.112s and memory use maxed at around 880M on each of the 3 workers:

== Memory Used (MiB) ==
 884 ┤                                        ╭╮       ╭────╮
 837 ┤                                   ╭────╯│  ╭────╯────╰────
 789 ┤                               ╭───╯ ╭──────╯─╯
 742 ┤                     ╭─────────╯─╭───╯──╯
 695 ┤                    ╭────────────╯
 648 ┤               ╭╭───╯
 600 ┤            ╭───╯─╯
 553 ┤      ╭─────│╭╯────────────────────────────────────────────
 506 ┼──────╯    ╭╯╯
 458 ┤         ╭╭╯╯
 411 ┼───────╭──╯╯
 364 ┼───────╯
 316 ┤
 269 ┤
 222 ┤
 175 ┼───────────────────────────────────────────────────────────
 
      ■ spark-worker-run-0798675b8a41   ■ spark-worker-1   ■ spark-worker-2   ■ spark-worker-3   ■ spark-master

This is a little surprising, either (1) the sort is much more expensive then we thought or (2) this benchmark is very inaccurate. Regardless, It feels reasonable to open this PR for review.

@hamersaw hamersaw marked this pull request as ready for review February 22, 2026 03:23
@hamersaw hamersaw requested a review from jackye1995 February 22, 2026 03:24
@hamersaw
Copy link
Collaborator Author

@jiaoew1991 interested in your thoughts here since this is riffing on your initial approach.

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>
Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request performance Features that improves performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Unnecessarily ordering by row address on column backfill writers

1 participant