feat: Implement PositionDeleteWriter for position delete files by shangxinli · Pull Request #582 · apache/iceberg-cpp

shangxinli · 2026-02-28T17:53:01Z

Implement the PositionDeleteWriter following the same PIMPL pattern as DataWriter. The writer supports both buffered WriteDelete(file_path, pos) calls and direct Write(ArrowArray*) for pre-formed batches. Metadata reports content=kPositionDeletes with sort_order_id=nullopt per spec, and tracks referenced_data_file when all deletes target a single file.

Implement the PositionDeleteWriter following the same PIMPL pattern as DataWriter. The writer supports both buffered WriteDelete(file_path, pos) calls and direct Write(ArrowArray*) for pre-formed batches. Metadata reports content=kPositionDeletes with sort_order_id=nullopt per spec, and tracks referenced_data_file when all deletes target a single file. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

evindj

Thanks for working on implementing this feature. The change looks good, my only comment is around the threshold for flushing data.

evindj · 2026-03-11T13:25:37Z

src/iceberg/data/position_delete_writer.cc

+    buffered_positions_.push_back(pos);
+    referenced_paths_.emplace(file_path);
+
+    if (static_cast<int64_t>(buffered_paths_.size()) >= kFlushThreshold) {


Do we want to make kFlushThreshold configurable via options?

Good idea. Made flush_threshold configurable via PositionDeleteWriterOptions with a default of 1000.

evindj · 2026-03-11T13:39:12Z

src/iceberg/test/data_writer_test.cc

+  const auto& data_file = metadata_result.value().data_files[0];
+  EXPECT_EQ(data_file->content, DataFile::Content::kPositionDeletes);
+  EXPECT_GT(data_file->file_size_in_bytes, 0);
+}


It would be nice if you could test the automatic flush logic.

Added AutoFlushOnThreshold test that sets threshold to 5, writes 12 deletes, and verifies data was flushed automatically before Close().

Make kFlushThreshold configurable via PositionDeleteWriterOptions with a default of 1000. Add AutoFlushOnThreshold test that uses a small threshold to verify the automatic flush logic.

wgtmac

Review Summary

I have reviewed the changes against the Java implementation for strict parity, C++ styling, and logic issues. There are a few logic and parity concerns that need to be addressed, mainly regarding RAII memory safety in FlushBuffer, metrics filtering for delete columns, and tracking referenced_paths_ during batch writes.

Note: This review was generated by Gemini.

wgtmac · 2026-03-13T15:19:15Z

src/iceberg/data/position_delete_writer.cc

+    ArrowArray array;
+    ArrowError error;
+    ICEBERG_NANOARROW_RETURN_UNEXPECTED_WITH_ERROR(
+        ArrowArrayInitFromSchema(&array, &arrow_schema, &error), error);


Logic Issue (Memory Leak):
arrow_schema and array are not protected by RAII wrappers. If any ICEBERG_NANOARROW_RETURN_UNEXPECTED macro fails before arrow_schema.release(&arrow_schema) or before writer_->Write(&array) transfers ownership, the memory will leak.

Fix: Use internal::ArrowSchemaGuard schema_guard(&arrow_schema); and internal::ArrowArrayGuard array_guard(&array); to safely manage their lifecycles.

Done. Added ArrowSchemaGuard and ArrowArrayGuard in FlushBuffer to ensure proper cleanup on early returns. Also fixed the guard destructors to check release \!= nullptr before calling ArrowArrayRelease/ArrowSchemaRelease, since Write() transfers ownership of the array via ImportRecordBatch (which sets release = nullptr).

wgtmac · 2026-03-13T15:19:15Z

src/iceberg/data/position_delete_writer.cc

+    auto split_offsets = writer_->split_offsets();
+
+    // Serialize literal bounds to binary format
+    std::map<int32_t, std::vector<uint8_t>> lower_bounds_map;


Parity Issue (Metrics Bloat):
Java's PositionDeleteWriter explicitly drops field counts (and bounds if referencing multiple files) for the file_path and pos columns to avoid bloating the manifest. C++ currently copies all metrics blindly.

Fix: Add a // TODO or implement logic to filter out lower_bounds, upper_bounds, and counts for MetadataColumns::kDeleteFilePathColumnId and MetadataColumns::kDeleteFilePosColumnId.

Done. Added metrics filtering in Metadata() to match Java's behavior: always erase value_counts, null_value_counts, and nan_value_counts for kDeleteFilePathColumnId and kDeleteFilePosColumnId. When referenced_paths_.size() > 1, also erase lower_bounds and upper_bounds for those columns. This mirrors Java's MetricsUtil.copyWithoutFieldCounts / copyWithoutFieldCountsAndBounds logic.

wgtmac · 2026-03-13T15:19:16Z

src/iceberg/data/position_delete_writer.cc

+        new Impl(std::move(options), std::move(delete_schema), std::move(writer)));
+  }
+
+  Status Write(ArrowArray* data) {


Parity / Logic Issue:
If users write batches directly using Write(ArrowArray* data), the referenced_paths_ set is never updated. As a result, Metadata() will fail to correctly populate referenced_data_file when the batch contains deletes for a single data file.

Fix: Add a // TODO: Extract paths from ArrowArray to update referenced_paths_ or return NotImplemented if batch writing should not be directly used this way.

Added a TODO comment. Extracting paths from the raw ArrowArray would require reading the string column from the C Data Interface, which adds complexity. Leaving this for a follow-up since the primary API is WriteDelete() which correctly tracks paths.

wgtmac · 2026-03-13T15:19:16Z

src/iceberg/data/position_delete_writer.cc

 class PositionDeleteWriter::Impl {
 public:
+  static Result<std::unique_ptr<Impl>> Make(PositionDeleteWriterOptions options) {
+    // Build the position delete schema with file_path and pos columns


Parity Issue (Missing Implementation):
PositionDeleteWriterOptions accepts a row_schema, but Impl::Make completely ignores it, creating a schema with only file_path and pos. The V2 spec allows position deletes to optionally include the deleted row.

Fix: Add a // TODO: Support writing row data if options.row_schema is provided.

Added a TODO comment. Note that Java has also deprecated the row field in PositionDelete as of v1.11.0, so this is low priority. Will add support in a follow-up if needed.

- Use ArrowSchemaGuard/ArrowArrayGuard in FlushBuffer for memory safety on early returns, fixing potential leaks when nanoarrow macros fail - Fix guards to handle already-consumed arrays (null release check) - Filter out value_counts/null_value_counts/nan_value_counts for delete metadata columns (file_path, pos) to match Java parity; also drop bounds when referencing multiple data files - Add TODO for extracting paths from ArrowArray in Write() to update referenced_paths_ for batch writes - Add TODO for row_schema support in position deletes (V2 spec)

evindj reviewed Mar 11, 2026

View reviewed changes

address review: make flush threshold configurable and add test

e683ce2

Make kFlushThreshold configurable via PositionDeleteWriterOptions with a default of 1000. Add AutoFlushOnThreshold test that uses a small threshold to verify the automatic flush logic.

evindj approved these changes Mar 11, 2026

View reviewed changes

wgtmac requested changes Mar 13, 2026

View reviewed changes

Conversation

shangxinli commented Feb 28, 2026

Uh oh!

evindj left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wgtmac left a comment

Choose a reason for hiding this comment

Review Summary

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants