Skip to content

refactor: consolidate snapshot expiration into MaintenanceTable #2143

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 48 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
0a94d96
Added initial units tests and Class for Removing a Snapshot
ForeverAngry Mar 29, 2025
5f0b62b
Added methods needed to expire snapshots by id, and optionally cleanu…
ForeverAngry Mar 31, 2025
f995daa
Update test_expire_snapshots.py
ForeverAngry Mar 31, 2025
65365e1
Added the builder method to __init__.py, updated the snapshot api wit…
ForeverAngry Apr 1, 2025
e28815f
Snapshots are not being transacted on, but need to re-assign refs
ForeverAngry Apr 1, 2025
4628ede
Fixed the test case.
ForeverAngry Apr 3, 2025
e80c41c
adding print statements to help with debugging
ForeverAngry Apr 3, 2025
cb9f0c9
Draft ready
ForeverAngry Apr 3, 2025
ebcff2b
Applied suggestions to Fix CICD
ForeverAngry Apr 3, 2025
97399bf
Merge branch 'main' into main
ForeverAngry Apr 3, 2025
95e5af2
Rebuild the poetry lock file.
ForeverAngry Apr 3, 2025
5ab5890
Merge branch 'main' into main
ForeverAngry Apr 4, 2025
5acd690
Refactor implementation of `ExpireSnapshots`
ForeverAngry Apr 13, 2025
d30a08c
Fixed format and linting issues
ForeverAngry Apr 13, 2025
e62ab58
Merge branch 'main' into main
ForeverAngry Apr 13, 2025
1af3258
Fixed format and linting issues
ForeverAngry Apr 13, 2025
352b48f
Merge branch 'main' of https://github.com/ForeverAngry/iceberg-python
ForeverAngry Apr 13, 2025
382e0ea
Merge branch 'main' into main
ForeverAngry Apr 18, 2025
549c183
rebased: from main
ForeverAngry Apr 19, 2025
386cb15
fixed: typo
ForeverAngry Apr 19, 2025
12729fa
removed errant files
ForeverAngry Apr 22, 2025
ce3515c
Added: public method signature to the init table file.
ForeverAngry Apr 22, 2025
28fce4b
Removed: `expire_snapshots_older_than` method, in favor of implementi…
ForeverAngry Apr 24, 2025
2c3153e
Update tests/table/test_expire_snapshots.py
ForeverAngry Apr 26, 2025
27c3ece
Removed: unrelated changes, Added: logic to expire snapshot method.
ForeverAngry Apr 26, 2025
fe73a34
feat: implement deduplication of data files in Iceberg table and remo…
ForeverAngry Jul 5, 2025
8dfa038
Closes:
ForeverAngry Jul 5, 2025
42e55c9
refactor: remove obsolete `expire_snapshots_older_than` method
ForeverAngry Jul 5, 2025
e1627c4
### Features & Enhancements
ForeverAngry Jul 5, 2025
0e6d45c
feat: enhance table maintenance with deduplication and snapshot reten…
ForeverAngry Jul 5, 2025
311c442
feat: update maintenance features with deduplication and retention st…
ForeverAngry Jul 5, 2025
fba592d
Update .gitignore
ForeverAngry Jul 5, 2025
b837f86
Update test_writes.py
ForeverAngry Jul 5, 2025
4605a04
Merge branch 'main' into refactor/consolidate-snapshot-expiration
ForeverAngry Jul 5, 2025
536528e
refactor: remove obsolete test file for snapshot expiration
ForeverAngry Jul 5, 2025
6036e12
wip: enhance deduplication logic and improve data file handling in ma…
ForeverAngry Jul 5, 2025
9dc9c82
wip - refactor: update deduplication tests to use file names instead …
ForeverAngry Jul 5, 2025
635a1d9
fix(table): correct deduplication logic for data files in Maintenance…
ForeverAngry Jul 5, 2025
73658e0
fix(tests): ensure commit_table is not called when no snapshots are e…
ForeverAngry Jul 5, 2025
a9a01ee
refactor: remove unused expire_snapshots method and clean up transact…
ForeverAngry Jul 5, 2025
8c906d2
refactor: streamline data file retrieval in MaintenanceTable and enha…
ForeverAngry Jul 6, 2025
0e72ccc
Reverted changes back to prior commit version for `_get_all_datafiles`
ForeverAngry Jul 6, 2025
cfb4061
refactor: simplify snapshot expiration logic and clean up unused imports
ForeverAngry Jul 6, 2025
9371bca
Merge branch 'main' into refactor/consolidate-snapshot-expiration
ForeverAngry Jul 6, 2025
881fab9
fix: add missing newline in API documentation for clarity
ForeverAngry Jul 6, 2025
acb70da
refactor: update license header in test_retention_strategies.py
ForeverAngry Jul 6, 2025
54c1f7f
feat: add license header to test_overwrite_files.py
ForeverAngry Jul 6, 2025
4c6f86c
Update test_literals.py
ForeverAngry Jul 7, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
734 changes: 89 additions & 645 deletions mkdocs/docs/api.md

Large diffs are not rendered by default.

16 changes: 11 additions & 5 deletions pyiceberg/table/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,7 @@
from pyiceberg.schema import Schema
from pyiceberg.table.inspect import InspectTable
from pyiceberg.table.locations import LocationProvider, load_location_provider
from pyiceberg.table.maintenance import MaintenanceTable
from pyiceberg.table.metadata import (
INITIAL_SEQUENCE_NUMBER,
TableMetadata,
Expand Down Expand Up @@ -115,7 +116,7 @@
update_table_metadata,
)
from pyiceberg.table.update.schema import UpdateSchema
from pyiceberg.table.update.snapshot import ExpireSnapshots, ManageSnapshots, UpdateSnapshot, _FastAppendFiles
from pyiceberg.table.update.snapshot import ManageSnapshots, UpdateSnapshot, _FastAppendFiles
from pyiceberg.table.update.spec import UpdateSpec
from pyiceberg.table.update.statistics import UpdateStatistics
from pyiceberg.transforms import IdentityTransform
Expand Down Expand Up @@ -1069,6 +1070,15 @@ def inspect(self) -> InspectTable:
"""
return InspectTable(self)

@property
def maintenance(self) -> MaintenanceTable:
"""Return the MaintenanceTable object for maintenance.

Returns:
MaintenanceTable object based on this Table.
"""
return MaintenanceTable(self)

def refresh(self) -> Table:
"""Refresh the current table metadata.

Expand Down Expand Up @@ -1241,10 +1251,6 @@ def manage_snapshots(self) -> ManageSnapshots:
"""
return ManageSnapshots(transaction=Transaction(self, autocommit=True))

def expire_snapshots(self) -> ExpireSnapshots:
"""Shorthand to run expire snapshots by id or by a timestamp."""
return ExpireSnapshots(transaction=Transaction(self, autocommit=True))

def update_statistics(self) -> UpdateStatistics:
"""
Shorthand to run statistics management operations like add statistics and remove statistics.
Expand Down
54 changes: 44 additions & 10 deletions pyiceberg/table/inspect.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,8 @@
from __future__ import annotations

from datetime import datetime, timezone
from typing import TYPE_CHECKING, Any, Dict, Iterator, List, Optional, Set, Tuple
from functools import reduce
from typing import TYPE_CHECKING, Any, Dict, Iterator, List, Optional, Set, Tuple, Union

from pyiceberg.conversions import from_bytes
from pyiceberg.manifest import DataFile, DataFileContent, ManifestContent, ManifestFile, PartitionFieldSummary
Expand Down Expand Up @@ -650,14 +651,11 @@ def _files(self, snapshot_id: Optional[int] = None, data_file_filter: Optional[S

snapshot = self._get_snapshot(snapshot_id)
io = self.tbl.io
files_table: list[pa.Table] = []
for manifest_list in snapshot.manifests(io):
files_table.append(self._get_files_from_manifest(manifest_list, data_file_filter))

executor = ExecutorFactory.get_or_create()
results = list(
executor.map(
lambda manifest_list: self._get_files_from_manifest(manifest_list, data_file_filter), snapshot.manifests(io)
)
)
return pa.concat_tables(results)
return pa.concat_tables(files_table)
Comment on lines +654 to +658
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is de-parallelized? Is that intentional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, it was not :) Good catch!


def files(self, snapshot_id: Optional[int] = None) -> "pa.Table":
return self._files(snapshot_id)
Expand All @@ -668,10 +666,20 @@ def data_files(self, snapshot_id: Optional[int] = None) -> "pa.Table":
def delete_files(self, snapshot_id: Optional[int] = None) -> "pa.Table":
return self._files(snapshot_id, {DataFileContent.POSITION_DELETES, DataFileContent.EQUALITY_DELETES})

def all_manifests(self) -> "pa.Table":
def all_manifests(self, snapshots: Optional[Union[list[Snapshot], list[int]]] = None) -> "pa.Table":
import pyarrow as pa

snapshots = self.tbl.snapshots()
# coerce into snapshot objects if users passes in snapshot ids
if snapshots is not None:
if isinstance(snapshots[0], int):
snapshots = [
snapshot
for snapshot_id in snapshots
if (snapshot := self.tbl.metadata.snapshot_by_id(snapshot_id)) is not None
]
else:
snapshots = self.tbl.snapshots()

Comment on lines +669 to +682
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might have written this and it got cherry picked in but I think its simpler to only allow Snapshot objects until there is a need to allow either

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tl;dr
Sounds good ill make the change!

I originally forked your branch so I could stack my PR on your “Delete Orphans” PR. As July began, my schedule looked pretty rough, so I converted my draft into a PR against the main pyiceberg branch—since I wasn’t sure how much time I’d have later in the month—but I forgot to rebase. As a result, I inadvertently removed the code for deleting orphans while keeping your MaintenanceTable implementation in a more… manual way 😟, so there may still be some remnants. They say the best form of flattery is imitation 😉.

if not snapshots:
return pa.Table.from_pylist([], schema=self._get_all_manifests_schema())

Expand All @@ -681,6 +689,32 @@ def all_manifests(self) -> "pa.Table":
)
return pa.concat_tables(manifests_by_snapshots)

def _all_known_files(self) -> dict[str, set[str]]:
"""Get all the known files in the table.

Returns:
dict of {file_type: set of file paths} for each file type.
"""
snapshots = self.tbl.snapshots()

_all_known_files = {}
_all_known_files["manifests"] = set(self.all_manifests(snapshots)["path"].to_pylist())
_all_known_files["manifest_lists"] = {snapshot.manifest_list for snapshot in snapshots}
_all_known_files["statistics"] = {statistic.statistics_path for statistic in self.tbl.metadata.statistics}

metadata_files = {entry.metadata_file for entry in self.tbl.metadata.metadata_log}
metadata_files.add(self.tbl.metadata_location) # Include current metadata file
_all_known_files["metadata"] = metadata_files

executor = ExecutorFactory.get_or_create()
snapshot_ids = [snapshot.snapshot_id for snapshot in snapshots]
files_by_snapshots: Iterator[Set[str]] = executor.map(
lambda snapshot_id: set(self.files(snapshot_id)["file_path"].to_pylist()), snapshot_ids
)
_all_known_files["datafiles"] = reduce(set.union, files_by_snapshots, set())

return _all_known_files

def _all_files(self, data_file_filter: Optional[Set[DataFileContent]] = None) -> "pa.Table":
import pyarrow as pa

Expand Down
Loading
Loading