Currently, when parallelizing compaction tasks (ex. over large datasets) each task individually calls reserve_fragment_ids. This operation requires a transactional manifest update. When many of these occur in parallel, one will succeed and the others will need to be retried; where each retry incurs some backoff delay before rereading each manifest (from the original) and reattempting to commit.
In my test, which ran parallelized compaction (24 threads) over a 100k fragment dataset (16 rows per fragment) while targeting 144 fragments for the resulting dataset required 400+ manifest write attempts (144 expected). This manifests as a 1h53m runtime, which is significantly slower than just rewriting the dataset with 144 fragments (23m55s).
Currently, when parallelizing compaction tasks (ex. over large datasets) each task individually calls
reserve_fragment_ids. This operation requires a transactional manifest update. When many of these occur in parallel, one will succeed and the others will need to be retried; where each retry incurs some backoff delay before rereading each manifest (from the original) and reattempting to commit.In my test, which ran parallelized compaction (24 threads) over a 100k fragment dataset (16 rows per fragment) while targeting 144 fragments for the resulting dataset required 400+ manifest write attempts (144 expected). This manifests as a 1h53m runtime, which is significantly slower than just rewriting the dataset with 144 fragments (23m55s).