perf: Remove sort from add and update columns flow by hamersaw · Pull Request #259 · lance-format/lance-spark

hamersaw · 2026-02-20T15:16:40Z

Currently, we are sorting on _rowaddr so that in the BatchWriter we can partition by fragment ID, where we have a single fragment writer open at a time when we sequentially iterate over the (sorted) input data. For very large datasets this sort operation can be expensive (ex. time + memory utilization). This PR removes the sort and instead creates per fragment buffers on the writer that are packed during record ingestion and then flushed on commit. This should be faster on large datasets and the initial benchmark below shows that the fragment buffer packing actually has lower memory utilization than the currently required sort operation.

Closes #255.

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>

github-actions · 2026-02-20T15:16:57Z

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

hamersaw · 2026-02-22T03:22:09Z

Did a small Spark test locally - setup a 1 master / 3 worker test cluster locally with 2 CPU / 8G RAM for each. Wrote a lance dataset with 10m rows over 1k fragments (# of fragments / size of additional column(s) should be memory scaling for this approach). Then added a small column (hash(int)) and compared the existing (sorted) to this (fragment buffered).

For the current sorting approach execution time took 27.482s and memory maxed around 1450M on each of the 3 workers:

== Memory Used (MiB) ==
 1466 ┤                                            ╭──────────────
 1381 ┤                                           ╭╯╭─────────────
 1296 ┤                                          ╭╯─╯
 1210 ┤                                         ╭╯│
 1125 ┤                                      ╭──╯─╯
 1040 ┤                                      ││╭╯
  955 ┤                                     ╭│╭╯
  869 ┤                                ╭───╮╭╯╯
  784 ┤                         ╭─────╭─────╯╯
  699 ┤             ╭──╭──────────────╯─╯
  613 ┤     ╭─────╭────╯──────────────────────────────────────────
  528 ┼─────╯   ╭╭╯
  443 ┼──────────╯
  358 ┤
  272 ┤
  187 ┼───────────────────────────────────────────────────────────
                            Memory Used (MiB)

       ■ spark-worker-run-99611c5c6c51   ■ spark-worker-2   ■ spark-worker-1   ■ spark-worker-3   ■ spark-master

For the new (buffered) approach, runtime was 26.112s and memory use maxed at around 880M on each of the 3 workers:

== Memory Used (MiB) ==
 884 ┤                                        ╭╮       ╭────╮
 837 ┤                                   ╭────╯│  ╭────╯────╰────
 789 ┤                               ╭───╯ ╭──────╯─╯
 742 ┤                     ╭─────────╯─╭───╯──╯
 695 ┤                    ╭────────────╯
 648 ┤               ╭╭───╯
 600 ┤            ╭───╯─╯
 553 ┤      ╭─────│╭╯────────────────────────────────────────────
 506 ┼──────╯    ╭╯╯
 458 ┤         ╭╭╯╯
 411 ┼───────╭──╯╯
 364 ┼───────╯
 316 ┤
 269 ┤
 222 ┤
 175 ┼───────────────────────────────────────────────────────────
 
      ■ spark-worker-run-0798675b8a41   ■ spark-worker-1   ■ spark-worker-2   ■ spark-worker-3   ■ spark-master

This is a little surprising, either (1) the sort is much more expensive then we thought or (2) this benchmark is very inaccurate. Regardless, It feels reasonable to open this PR for review.

hamersaw · 2026-02-23T15:56:31Z

@jiaoew1991 interested in your thoughts here since this is riffing on your initial approach.

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>

hamersaw added 3 commits February 20, 2026 01:57

remove shuffle on write

78395bc

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>

ensuring all buffers are closed correctly

a4598ea

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>

refactored to AbstractBackfillWriter

3581de9

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>

hamersaw changed the title ~~Remove sort from ADD / UPDATE COLUMNS flow~~ feat: Remove sort from ADD / UPDATE COLUMNS flow Feb 20, 2026

github-actions bot added the enhancement New feature or request label Feb 20, 2026

hamersaw changed the title ~~feat: Remove sort from ADD / UPDATE COLUMNS flow~~ perf: Remove sort from ADD / UPDATE COLUMNS flow Feb 20, 2026

github-actions bot added the performance Features that improves performance label Feb 20, 2026

hamersaw changed the title ~~perf: Remove sort from ADD / UPDATE COLUMNS flow~~ perf: Remove sort from add and update columns flow Feb 20, 2026

hamersaw marked this pull request as ready for review February 22, 2026 03:23

hamersaw requested a review from jackye1995 February 22, 2026 03:24

hamersaw added 2 commits February 23, 2026 10:10

Merge remote-tracking branch 'upstream/main' into feature/buffer-writer

622dcf9

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>

hopefully last time

a4c0074

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: Remove sort from add and update columns flow#259

perf: Remove sort from add and update columns flow#259
hamersaw wants to merge 5 commits intolance-format:mainfrom
hamersaw:feature/buffer-writer

hamersaw commented Feb 20, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 20, 2026

Uh oh!

hamersaw commented Feb 22, 2026 •

edited

Loading

Uh oh!

hamersaw commented Feb 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hamersaw commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Feb 20, 2026

Uh oh!

hamersaw commented Feb 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hamersaw commented Feb 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hamersaw commented Feb 20, 2026 •

edited

Loading

hamersaw commented Feb 22, 2026 •

edited

Loading