Skip to content

feat: Implement PositionDeleteWriter for position delete files#582

Open
shangxinli wants to merge 3 commits intoapache:mainfrom
shangxinli:implement-position-delete-writer
Open

feat: Implement PositionDeleteWriter for position delete files#582
shangxinli wants to merge 3 commits intoapache:mainfrom
shangxinli:implement-position-delete-writer

Conversation

@shangxinli
Copy link
Contributor

Implement the PositionDeleteWriter following the same PIMPL pattern as DataWriter. The writer supports both buffered WriteDelete(file_path, pos) calls and direct Write(ArrowArray*) for pre-formed batches. Metadata reports content=kPositionDeletes with sort_order_id=nullopt per spec, and tracks referenced_data_file when all deletes target a single file.

Implement the PositionDeleteWriter following the same PIMPL pattern as
DataWriter. The writer supports both buffered WriteDelete(file_path, pos)
calls and direct Write(ArrowArray*) for pre-formed batches. Metadata
reports content=kPositionDeletes with sort_order_id=nullopt per spec,
and tracks referenced_data_file when all deletes target a single file.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Contributor

@evindj evindj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on implementing this feature. The change looks good, my only comment is around the threshold for flushing data.

buffered_positions_.push_back(pos);
referenced_paths_.emplace(file_path);

if (static_cast<int64_t>(buffered_paths_.size()) >= kFlushThreshold) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to make kFlushThreshold configurable via options?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. Made flush_threshold configurable via PositionDeleteWriterOptions with a default of 1000.

const auto& data_file = metadata_result.value().data_files[0];
EXPECT_EQ(data_file->content, DataFile::Content::kPositionDeletes);
EXPECT_GT(data_file->file_size_in_bytes, 0);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if you could test the automatic flush logic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added AutoFlushOnThreshold test that sets threshold to 5, writes 12 deletes, and verifies data was flushed automatically before Close().

Make kFlushThreshold configurable via PositionDeleteWriterOptions with
a default of 1000. Add AutoFlushOnThreshold test that uses a small
threshold to verify the automatic flush logic.
Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

I have reviewed the changes against the Java implementation for strict parity, C++ styling, and logic issues. There are a few logic and parity concerns that need to be addressed, mainly regarding RAII memory safety in FlushBuffer, metrics filtering for delete columns, and tracking referenced_paths_ during batch writes.

Note: This review was generated by Gemini.

ArrowArray array;
ArrowError error;
ICEBERG_NANOARROW_RETURN_UNEXPECTED_WITH_ERROR(
ArrowArrayInitFromSchema(&array, &arrow_schema, &error), error);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic Issue (Memory Leak):
arrow_schema and array are not protected by RAII wrappers. If any ICEBERG_NANOARROW_RETURN_UNEXPECTED macro fails before arrow_schema.release(&arrow_schema) or before writer_->Write(&array) transfers ownership, the memory will leak.

Fix: Use internal::ArrowSchemaGuard schema_guard(&arrow_schema); and internal::ArrowArrayGuard array_guard(&array); to safely manage their lifecycles.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Added ArrowSchemaGuard and ArrowArrayGuard in FlushBuffer to ensure proper cleanup on early returns. Also fixed the guard destructors to check release \!= nullptr before calling ArrowArrayRelease/ArrowSchemaRelease, since Write() transfers ownership of the array via ImportRecordBatch (which sets release = nullptr).

auto split_offsets = writer_->split_offsets();

// Serialize literal bounds to binary format
std::map<int32_t, std::vector<uint8_t>> lower_bounds_map;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parity Issue (Metrics Bloat):
Java's PositionDeleteWriter explicitly drops field counts (and bounds if referencing multiple files) for the file_path and pos columns to avoid bloating the manifest. C++ currently copies all metrics blindly.

Fix: Add a // TODO or implement logic to filter out lower_bounds, upper_bounds, and counts for MetadataColumns::kDeleteFilePathColumnId and MetadataColumns::kDeleteFilePosColumnId.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Added metrics filtering in Metadata() to match Java's behavior: always erase value_counts, null_value_counts, and nan_value_counts for kDeleteFilePathColumnId and kDeleteFilePosColumnId. When referenced_paths_.size() > 1, also erase lower_bounds and upper_bounds for those columns. This mirrors Java's MetricsUtil.copyWithoutFieldCounts / copyWithoutFieldCountsAndBounds logic.

new Impl(std::move(options), std::move(delete_schema), std::move(writer)));
}

Status Write(ArrowArray* data) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parity / Logic Issue:
If users write batches directly using Write(ArrowArray* data), the referenced_paths_ set is never updated. As a result, Metadata() will fail to correctly populate referenced_data_file when the batch contains deletes for a single data file.

Fix: Add a // TODO: Extract paths from ArrowArray to update referenced_paths_ or return NotImplemented if batch writing should not be directly used this way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a TODO comment. Extracting paths from the raw ArrowArray would require reading the string column from the C Data Interface, which adds complexity. Leaving this for a follow-up since the primary API is WriteDelete() which correctly tracks paths.

class PositionDeleteWriter::Impl {
public:
static Result<std::unique_ptr<Impl>> Make(PositionDeleteWriterOptions options) {
// Build the position delete schema with file_path and pos columns
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parity Issue (Missing Implementation):
PositionDeleteWriterOptions accepts a row_schema, but Impl::Make completely ignores it, creating a schema with only file_path and pos. The V2 spec allows position deletes to optionally include the deleted row.

Fix: Add a // TODO: Support writing row data if options.row_schema is provided.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a TODO comment. Note that Java has also deprecated the row field in PositionDelete as of v1.11.0, so this is low priority. Will add support in a follow-up if needed.

- Use ArrowSchemaGuard/ArrowArrayGuard in FlushBuffer for memory safety
  on early returns, fixing potential leaks when nanoarrow macros fail
- Fix guards to handle already-consumed arrays (null release check)
- Filter out value_counts/null_value_counts/nan_value_counts for
  delete metadata columns (file_path, pos) to match Java parity;
  also drop bounds when referencing multiple data files
- Add TODO for extracting paths from ArrowArray in Write() to update
  referenced_paths_ for batch writes
- Add TODO for row_schema support in position deletes (V2 spec)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants