Unnecessarily ordering by row address on column backfill writers

The `AddColumnsBackfillWrite` and `UpdateColumnsBackfillWrite` have order requirements (ie. `_rowaddr`) for the field. This can result an a relatively expensive shuffle for large amounts of data. The only reason for this order is so that the `BatchWriter` can iterate over the rows, have a single fragment writer open at a time, and sequentially iterate through writes.

To bypass this shuffling phase, we could instead update the `BatchWriter` to take a non-ordered list, and during processing bucket data into separate fragment buffers. Then on `commit` we iterate over these buffers and flush to storage in fragments.

We will need to make sure the memory requirements are not significantly impacted with this approach. Since all the data is in-memory, there should be a minimal overhead of maintaining separate buffers for each fragment within the writer. This should be negligible, but should be validated through testing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unnecessarily ordering by row address on column backfill writers #255

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Unnecessarily ordering by row address on column backfill writers #255

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions