-
Notifications
You must be signed in to change notification settings - Fork 0
Dynamic Overwrite #24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: fd-add-ability-to-delete-full-data-files
Are you sure you want to change the base?
Dynamic Overwrite #24
Conversation
pyiceberg/manifest.py
Outdated
@@ -909,7 +909,7 @@ def __init__(self, output_file: OutputFile, snapshot_id: int, parent_snapshot_id | |||
self._sequence_number = sequence_number | |||
|
|||
def prepare_manifest(self, manifest_file: ManifestFile) -> ManifestFile: | |||
wrapped_manifest_file = ManifestFile(*manifest_file.record_fields()) | |||
wrapped_manifest_file = ManifestFile(*manifest_file.record_values()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Record class should be extremely lightweight, and we should avoid adding additional methods to it. I've changed this into a copy
in apache#580
@@ -311,6 +311,10 @@ def _(self, _: TimeType) -> Literal[int]: | |||
def _(self, _: TimestampType) -> Literal[int]: | |||
return TimestampLiteral(self.value) | |||
|
|||
@to.register(TimestamptzType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
pyiceberg/table/__init__.py
Outdated
@@ -3011,9 +3088,11 @@ def __init__( | |||
io: FileIO, | |||
commit_uuid: Optional[uuid.UUID] = None, | |||
snapshot_properties: Dict[str, str] = EMPTY_DICT, | |||
only_delete_within_latest_spec: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only_delete_within_latest_spec: bool = False, | |
only_delete_current_spec: bool = False, |
The latest can be different from the current spec. If you add a field to the partition spec and drop it immediately, it will re-use the existing spec.
pyiceberg/typedef.py
Outdated
@@ -199,6 +199,10 @@ def __repr__(self) -> str: | |||
return f"{self.__class__.__name__}[{', '.join(f'{key}={repr(value)}' for key, value in self.__dict__.items() if not key.startswith('_'))}]" | |||
|
|||
def record_fields(self) -> List[str]: | |||
"""Return all the fields of the Record class except those specified in skip_fields.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At some point, the Record
should be just a list that represents a struct. The record itself should not carry a schema. See apache#579
_check_schema_compatible(self._table.schema(), other_schema=df.schema) | ||
|
||
# cast if the two schemas are compatible but not equal | ||
table_arrow_schema = self._table.schema().as_arrow() | ||
if table_arrow_schema != df.schema: | ||
df = df.cast(table_arrow_schema) | ||
|
||
# If dataframe does not have data, there is no need to overwrite | ||
if df.shape[0] == 0: | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At some point we might want to consolidate these checks
pyiceberg/table/__init__.py
Outdated
for partition in delete_partitions: | ||
match_partition_expression: BooleanExpression = AlwaysTrue() | ||
partition_fields = partition.record_fields() | ||
for pos in range(len(partition_fields)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of relying on the .record_fields()
you should pass in the PartitionSpec
and loop over the partition-fields instead.
pyiceberg/table/__init__.py
Outdated
) | ||
) | ||
with self.update_snapshot(snapshot_properties=snapshot_properties).delete( | ||
only_delete_within_latest_spec=True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm having an internal debate with myself what the right thing to do here, or what the user expects. The delete method will also rewrite older specs into newer ones if there is a partial delete.
I think if I would evolve the spec from a monthly to a daily partitioning. Run a dynamic overwrite over certain days, that the data would be overwritten.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Yes, originally I think it would be dangerous for a user to dynamic overwrite partitions not in the current spec (because unlike static overwrite, user does not provide a filter explicitly and he might be unconcious of this). But after discussion with Sung, we both think it is more natural to detect the filter in current spec but apply to all data files than requiring the user to have awareness of the timepoint where partiiton evolution happens so that he knows what data files will get touched and what will not.
Added dynamic overwrite support leveraging delete (without overwrite) + fast_append