Skip to content

Compare Schema and StructType fields irrespective of ordering #700

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

kevinjqliu
Copy link
Contributor

Fixes #674

Schema and StructType fields variable is represented by Tuple, which means that ordering matters when performing comparison.

Two Schemas with the same fields in different order should be consider the same

fields: Tuple[NestedField, ...] = Field(default_factory=tuple)

fields: Tuple[NestedField, ...] = Field(default_factory=tuple)

@@ -372,7 +372,7 @@ def test_writer_ordering() -> None:
),
)

expected = StructWriter(((1, DoubleWriter()), (0, StringWriter())))
expected = StructWriter(((0, DoubleWriter()), (1, StringWriter())))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this change is semantically correct. This test is affected because resolve_writer compares the two given schemas (record_schema and file_schema)

def resolve_writer(
record_schema: Union[Schema, IcebergType],
file_schema: Union[Schema, IcebergType],
) -> Writer:
"""Resolve the file and read schema to produce a reader.
Args:
record_schema (Schema | IcebergType): The schema of the record in memory.
file_schema (Schema | IcebergType): The schema of the file that will be written
Raises:
NotImplementedError: If attempting to resolve an unrecognized object type.
"""
if record_schema == file_schema:
return construct_writer(file_schema)

Previously, comparison returned False due to different ordering

@@ -1730,19 +1730,17 @@ def test_move_nested_field_after_first(catalog: Catalog) -> None:
with tbl.update_schema() as schema_update:
schema_update.move_before("struct.data", "struct.count")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this makes me think that the Field ordering does matter...

@Fokko wdyt

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all, thanks for digging into this 🎉

Technically the ordering does not matter when you write the data, because when reading we're correcting the order using this one:

def to_requested_schema(requested_schema: Schema, file_schema: Schema, table: pa.Table) -> pa.Table:

Maybe we should also use that visitor when writing (instead of the PyArrow cast) introduced in #523

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense, thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should also use that visitor when writing (instead of the PyArrow cast) introduced

We're relying on pyarrow cast to translate some pyarrow data types into corresponding Iceberg-supported data types. Such as large_string -> string. Since LargeString is not an Iceberg-supported data type, we cannot use to_requested_schema. Maybe it's possible to cast pyarrow LargeString into Iceberg String.

Copy link
Contributor

@Fokko Fokko May 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want to do that. The String and LargeString are an Arrow encoding detail (similar to the categorial type). Maybe we should have a different version of to_requested_schema where we don't cast, and just keep the original types? If the types are incompatible (for example, the field-id points to a string in the schema, and you try to write a boolean, it should fail).

@kevinjqliu
Copy link
Contributor Author

Closing in favor of #829

@kevinjqliu kevinjqliu closed this Jun 18, 2024
@kevinjqliu kevinjqliu deleted the kevinjqliu/schema-field-order branch June 19, 2024 01:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ValueError: Mismatch in fields: ?
2 participants