Skip to content

refactor: consolidate snapshot expiration into MaintenanceTable #2143

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 47 commits into
base: main
Choose a base branch
from

Conversation

ForeverAngry
Copy link
Contributor

@ForeverAngry ForeverAngry commented Jun 23, 2025

Rationale for this change

  • Consolidates snapshot expiration functionality from the standalone ExpireSnapshots class into the MaintenanceTable class for a unified maintenance API.
  • Resolves planned work left over from Added ExpireSnapshots Feature #1880.
  • Achieves feature and API parity with the Java implementation for snapshot retention and table maintenance.

Features & Enhancements

  • Duplicate Data File Remediation (#2130)

    • Adds deduplicate_data_files to MaintenanceTable.
    • Detects and removes duplicate data files, improving table hygiene and storage efficiency.
  • Advanced Snapshot Retention (#2150)

    • Adds new snapshot retention methods for Java API parity:
      • retain_last_n_snapshots(n) — Retain only the latest N snapshots.
      • expire_snapshots_older_than_with_retention(timestamp_ms, retain_last_n=None, min_snapshots_to_keep=None) — Expire snapshots older than a given timestamp, with optional retention constraints.
      • expire_snapshots_with_retention_policy(timestamp_ms=None, retain_last_n=None, min_snapshots_to_keep=None) — Unified retention policy supporting both time-based and count-based constraints.
    • All retention logic respects protected snapshots (branches/tags) and includes guardrails to prevent excessive expiration.
    • Removes the obsolete expire_snapshots_older_than method.

Bug Fixes & Cleanups

  • ManageSnapshots Cleanup (#2151)
    • Removes an unrelated instance variable from the ManageSnapshots class, aligning with the Java reference implementation.

Testing & Documentation

  • Testing:
    • All snapshot expiration and retention tests consolidated into test_retention_strategies.py, including:
      • Expiration by ID and timestamp
      • Protection of branch/tag snapshots
      • Retention guardrails and combined policies
      • Deduplication of data files
  • Documentation:
    • Added and updated documentation to describe:
      • All new retention strategies
      • Deduplication logic
      • API parity and usage examples

Are these changes tested?

Yes. All changes are tested., with this PR predicated on the final changes from #1200. This work builds on the framework introduced by @jayceslesar in #1200 for the MaintenanceTable.

Are there any user-facing changes?

Breaking Changes:

  • ✅ Move ExpireSnapshots functionality to MaintenanceTable
  • ✅ Replace fluent API with direct execution pattern
  • ✅ Remove ExpireSnapshots class entirely
  • ✅ Update all tests to use new table.maintenance.* API
  • ✅ Maintain all existing validation and protection logic

API Changes

Before:

table.expire_snapshots().expire_snapshot_by_id(snapshot_id).commit()

Now:

table.maintenance.expire_snapshot_by_id(snapshot_id)
# Or use new retention/maintenance methods as documented

Closes:

ForeverAngry and others added 25 commits March 28, 2025 20:23
…h a new Expired Snapshot class. updated tests.
 ValueError: Cannot expire snapshot IDs {3051729675574597004} as they are currently referenced by table refs.
Moved expiration-related methods from `ExpireSnapshots` to `ManageSnapshots` for improved organization and clarity.

Updated corresponding pytest tests to reflect these changes.
Re-ran the `poetry run pre-commit run --all-files` command on the project.
Re-ran the `poetry run pre-commit run --all-files` command on the project.
Moved: the functions for expiring snapshots to their own class.
…ng it in a separate issue.

Fixed: unrelated changes caused by afork/branch sync issues.
Implemented logic to protect the HEAD branches or Tagged branches from being expired by the `expire_snapshot_by_id` method.
@ForeverAngry
Copy link
Contributor Author

@Fokko @jayceslesar let me know if you guys prefer i stack this pr into the #1200 or if you both would rather i wait until the #1200 is merged into main, and then rebase on the updated upstream/main, and then create the PR against apache/iceberg-python:main!

@Fokko
Copy link
Contributor

Fokko commented Jun 24, 2025

Great seeing this PR @ForeverAngry, thanks again for working on this! I'm okay with first merging #1200, but we could also merge this first, and adapt the remove orphan files routine to use .maintenance. Let me follow up on the remove orphan files, because there are some open questions there.

@Fokko Fokko added this to the PyIceberg 0.10.0 milestone Jun 24, 2025
@ForeverAngry
Copy link
Contributor Author

@Fokko did you decide if you wanted me to stay stacked on the delete orphans pr, or go ahead and prepare the pr for this, to the main branch?

@ForeverAngry ForeverAngry force-pushed the refactor/consolidate-snapshot-expiration branch from a6c3b63 to 9937894 Compare July 5, 2025 01:10
@ForeverAngry ForeverAngry force-pushed the refactor/consolidate-snapshot-expiration branch from 9937894 to 27c3ece Compare July 5, 2025 01:13
(1)
apache#2130 with addition of the new `deduplicate_data_files` function to the `MaintenanceTable` class.

(2) apache#2151 with the removal of the errant member variable from the `ManageSnapshots` class.

(3) apache#2150 by adding the additional functions to be at parity with the Java API.
- **Duplicate File Remediation apache#2130**
  - Added `deduplicate_data_files` to the `MaintenanceTable` class.
  - Enables detection and removal of duplicate data files, improving table hygiene and storage efficiency.

- **Support `retainLast` and `setMinSnapshotsToKeep` Snapshot Retention Policies apache#2150**
  - Added new snapshot retention methods to `MaintenanceTable` for feature parity with the Java API:
    - `retain_last_n_snapshots(n)`: Retain only the last N snapshots.
    - `expire_snapshots_older_than_with_retention(timestamp_ms, retain_last_n=None, min_snapshots_to_keep=None)`: Expire snapshots older than a timestamp, with additional retention constraints.
    - `expire_snapshots_with_retention_policy(timestamp_ms=None, retain_last_n=None, min_snapshots_to_keep=None)`: Unified retention policy supporting time-based and count-based constraints.
  - All retention logic respects protected snapshots (branches/tags) and includes guardrails to prevent over-aggressive expiration.

### Bug Fixes & Cleanups

- **Remove unrelated instance variable from the `ManageSnapshots` class apache#2151**
  - Removed an errant member variable from the `ManageSnapshots` class, aligning the implementation with the intended design and the Java reference.

### Testing & Documentation

- Consolidated all snapshot expiration and retention tests into a single file (`test_retention_strategies.py`), covering:
  - Basic expiration by ID and timestamp.
  - Protection of branch/tag snapshots.
  - Retention guardrails and combined policies.
  - Deduplication of data files.
- Added and updated documentation to describe all new retention strategies, deduplication, and API parity improvements.
Comment on lines +275 to +293
def _get_protected_snapshot_ids(self, table_metadata: TableMetadata) -> Set[int]:
"""Get the IDs of protected snapshots.

These are the HEAD snapshots of all branches and all tagged snapshots.
These ids are to be excluded from expiration.

Args:
table_metadata: The table metadata to check for protected snapshots.

Returns:
Set of protected snapshot IDs to exclude from expiration.
"""
from pyiceberg.table.refs import SnapshotRefType

protected_ids: Set[int] = set()
for ref in table_metadata.refs.values():
if ref.snapshot_ref_type in [SnapshotRefType.TAG, SnapshotRefType.BRANCH]:
protected_ids.add(ref.snapshot_id)
return protected_ids
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not know the answer to this but is this different than just the refs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think thats part of it, but there is a bit more validation around what is eligible to be expired. That being said, i dont think you initial intuition is wrong :), i think it all boils down to that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jayceslesar I went back and took a closer look at the refs, and wanted to give a slightly better response than my previous one. To me, the refs file seems like an object model and some enums. If I'm missing something, let me know! I really appreciate your responsiveness and input! 🙏 🚀

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

protected_ids is the same as set(table.inspect.refs()["snapshot_id"].to_pylist()) is what I was trying to say

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also the same as {ref.snapshot_id for ref in tbl.metadata.refs.values()} I think

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gotcha - im happy to make that change, if you like! Let me know!

…Table

The deduplicate_data_files() method was not properly removing duplicate
data file references from Iceberg tables. After deduplication, multiple
references to the same data file remained instead of the expected single
reference.

Root causes:
1. _get_all_datafiles() was scanning ALL snapshots instead of current only
2. Incorrect transaction API usage that didn't leverage snapshot updates
3. Missing proper overwrite logic to create clean deduplicated snapshots

Key fixes:
- Modified _get_all_datafiles() to scan only current snapshot manifests
- Implemented proper transaction pattern using update_snapshot().overwrite()
- Added explicit delete_data_file() calls for duplicates + append_data_file() for unique files
- Removed unused helper methods _get_all_datafiles_with_context() and _detect_duplicates()

Technical details:
- Deduplication now operates on ManifestEntry objects from current snapshot only
- Files are grouped by basename and first occurrence is kept as canonical reference
- New snapshot created atomically replaces current snapshot with deduplicated file list
- Proper Iceberg transaction semantics ensure data consistency

Tests: All deduplication tests now pass including the previously failing
test_deduplicate_data_files_removes_duplicates_in_current_snapshot

Fixes: Table maintenance deduplication functionality
@ForeverAngry ForeverAngry marked this pull request as ready for review July 5, 2025 23:53
@ForeverAngry
Copy link
Contributor Author

ForeverAngry commented Jul 5, 2025

@Fokko @jayceslesar, i wasn't sure when #1200 was going to be merged, and July tends to be pretty busy for me, so I thought i would use the framework for the MaintenanceTable that @jayceslesar created, so that things go smoothly, and as you requested, @Fokko. In addition, I had some other features that seem to fit right along with the MaintenanceTable class proposed and implemented by @jayceslesar in pr for #1200. I also figured id add a new entry to the api.md file for "Table Maintenance" that talks helps users understand the use-cases and features added. Let me know what everyone thinks!

Comment on lines +295 to +317
def _get_all_datafiles(self) -> List[DataFile]:
"""Collect all DataFiles in the current snapshot only."""
datafiles: List[DataFile] = []

current_snapshot = self.tbl.current_snapshot()
if not current_snapshot:
return datafiles

def process_manifest(manifest: ManifestFile) -> list[DataFile]:
found: list[DataFile] = []
for entry in manifest.fetch_manifest_entry(io=self.tbl.io, discard_deleted=True):
if hasattr(entry, "data_file"):
found.append(entry.data_file)
return found

# Scan only the current snapshot's manifests
manifests = current_snapshot.manifests(io=self.tbl.io)
with ThreadPoolExecutor() as executor:
results = executor.map(process_manifest, manifests)
for res in results:
datafiles.extend(res)

return datafiles
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar to above, why cant you use

def data_files(self, snapshot_id: Optional[int] = None) -> "pa.Table":

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar to above, why cant you use

def data_files(self, snapshot_id: Optional[int] = None) -> "pa.Table":

HI @jayceslesar , Yeah, i agree its similar. I actually looked at using inspect originally, and tried to use the DataFile.from_args() to go from the json object back to a DataFile, however, I couldn't seem to find a way to get this to work right, after trying a few different approaches. This was the easiest way i could think of. If you have an idea in mind, or know what I was missing, let me know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants