-
Notifications
You must be signed in to change notification settings - Fork 289
Compare Schema
and StructType
fields irrespective of ordering
#700
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -372,7 +372,7 @@ def test_writer_ordering() -> None: | |||
), | |||
) | |||
|
|||
expected = StructWriter(((1, DoubleWriter()), (0, StringWriter()))) | |||
expected = StructWriter(((0, DoubleWriter()), (1, StringWriter()))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this change is semantically correct. This test is affected because resolve_writer
compares the two given schemas (record_schema
and file_schema
)
iceberg-python/pyiceberg/avro/resolver.py
Lines 200 to 214 in 7bd5d9e
def resolve_writer( | |
record_schema: Union[Schema, IcebergType], | |
file_schema: Union[Schema, IcebergType], | |
) -> Writer: | |
"""Resolve the file and read schema to produce a reader. | |
Args: | |
record_schema (Schema | IcebergType): The schema of the record in memory. | |
file_schema (Schema | IcebergType): The schema of the file that will be written | |
Raises: | |
NotImplementedError: If attempting to resolve an unrecognized object type. | |
""" | |
if record_schema == file_schema: | |
return construct_writer(file_schema) |
Previously, comparison returned False
due to different ordering
@@ -1730,19 +1730,17 @@ def test_move_nested_field_after_first(catalog: Catalog) -> None: | |||
with tbl.update_schema() as schema_update: | |||
schema_update.move_before("struct.data", "struct.count") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this makes me think that the Field ordering does matter...
@Fokko wdyt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First of all, thanks for digging into this 🎉
Technically the ordering does not matter when you write the data, because when reading we're correcting the order using this one:
iceberg-python/pyiceberg/io/pyarrow.py
Line 1143 in d02d7a1
def to_requested_schema(requested_schema: Schema, file_schema: Schema, table: pa.Table) -> pa.Table: |
Maybe we should also use that visitor when writing (instead of the PyArrow cast) introduced in #523
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sense, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should also use that visitor when writing (instead of the PyArrow cast) introduced
We're relying on pyarrow cast
to translate some pyarrow data types into corresponding Iceberg-supported data types. Such as large_string
-> string
. Since LargeString
is not an Iceberg-supported data type, we cannot use to_requested_schema
. Maybe it's possible to cast pyarrow LargeString
into Iceberg String
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we want to do that. The String
and LargeString
are an Arrow encoding detail (similar to the categorial type). Maybe we should have a different version of to_requested_schema
where we don't cast, and just keep the original types? If the types are incompatible (for example, the field-id points to a string in the schema, and you try to write a boolean, it should fail).
Closing in favor of #829 |
Fixes #674
Schema
andStructType
fields
variable is represented byTuple
, which means that ordering matters when performing comparison.Two
Schema
s with the samefields
in different order should be consider the sameiceberg-python/pyiceberg/schema.py
Line 88 in 7bd5d9e
iceberg-python/pyiceberg/types.py
Line 346 in 7bd5d9e