Skip to content

Dynamic Overwrite #24

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: fd-add-ability-to-delete-full-data-files
Choose a base branch
from

Conversation

jqin61
Copy link

@jqin61 jqin61 commented May 29, 2024

Added dynamic overwrite support leveraging delete (without overwrite) + fast_append

@@ -909,7 +909,7 @@ def __init__(self, output_file: OutputFile, snapshot_id: int, parent_snapshot_id
self._sequence_number = sequence_number

def prepare_manifest(self, manifest_file: ManifestFile) -> ManifestFile:
wrapped_manifest_file = ManifestFile(*manifest_file.record_fields())
wrapped_manifest_file = ManifestFile(*manifest_file.record_values())
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Record class should be extremely lightweight, and we should avoid adding additional methods to it. I've changed this into a copy in apache#580

@@ -311,6 +311,10 @@ def _(self, _: TimeType) -> Literal[int]:
def _(self, _: TimestampType) -> Literal[int]:
return TimestampLiteral(self.value)

@to.register(TimestamptzType)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@@ -3011,9 +3088,11 @@ def __init__(
io: FileIO,
commit_uuid: Optional[uuid.UUID] = None,
snapshot_properties: Dict[str, str] = EMPTY_DICT,
only_delete_within_latest_spec: bool = False,
Copy link
Owner

@Fokko Fokko May 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
only_delete_within_latest_spec: bool = False,
only_delete_current_spec: bool = False,

The latest can be different from the current spec. If you add a field to the partition spec and drop it immediately, it will re-use the existing spec.

@@ -199,6 +199,10 @@ def __repr__(self) -> str:
return f"{self.__class__.__name__}[{', '.join(f'{key}={repr(value)}' for key, value in self.__dict__.items() if not key.startswith('_'))}]"

def record_fields(self) -> List[str]:
"""Return all the fields of the Record class except those specified in skip_fields."""
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some point, the Record should be just a list that represents a struct. The record itself should not carry a schema. See apache#579

Comment on lines +464 to +473
_check_schema_compatible(self._table.schema(), other_schema=df.schema)

# cast if the two schemas are compatible but not equal
table_arrow_schema = self._table.schema().as_arrow()
if table_arrow_schema != df.schema:
df = df.cast(table_arrow_schema)

# If dataframe does not have data, there is no need to overwrite
if df.shape[0] == 0:
return
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some point we might want to consolidate these checks

for partition in delete_partitions:
match_partition_expression: BooleanExpression = AlwaysTrue()
partition_fields = partition.record_fields()
for pos in range(len(partition_fields)):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of relying on the .record_fields() you should pass in the PartitionSpec and loop over the partition-fields instead.

)
)
with self.update_snapshot(snapshot_properties=snapshot_properties).delete(
only_delete_within_latest_spec=True
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having an internal debate with myself what the right thing to do here, or what the user expects. The delete method will also rewrite older specs into newer ones if there is a partial delete.

I think if I would evolve the spec from a monthly to a daily partitioning. Run a dynamic overwrite over certain days, that the data would be overwritten.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Yes, originally I think it would be dangerous for a user to dynamic overwrite partitions not in the current spec (because unlike static overwrite, user does not provide a filter explicitly and he might be unconcious of this). But after discussion with Sung, we both think it is more natural to detect the filter in current spec but apply to all data files than requiring the user to have awareness of the timepoint where partiiton evolution happens so that he knows what data files will get touched and what will not.

@jqin61 jqin61 requested a review from Fokko May 30, 2024 20:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants