Skip to content

Compare Schema and StructType fields irrespective of ordering #700

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion pyiceberg/schema.py
Original file line number Diff line number Diff line change
Expand Up @@ -121,7 +121,10 @@ def __eq__(self, other: Any) -> bool:
return False

identifier_field_ids_is_equal = self.identifier_field_ids == other.identifier_field_ids
schema_is_equal = all(lhs == rhs for lhs, rhs in zip(self.columns, other.columns))

left = sorted(self.columns, key=lambda field: field.field_id)
right = sorted(other.columns, key=lambda field: field.field_id)
schema_is_equal = all(lhs == rhs for lhs, rhs in zip(left, right))

return identifier_field_ids_is_equal and schema_is_equal

Expand Down
10 changes: 9 additions & 1 deletion pyiceberg/types.py
Original file line number Diff line number Diff line change
Expand Up @@ -393,7 +393,15 @@ def __hash__(self) -> int:

def __eq__(self, other: Any) -> bool:
"""Compare the object if it is equal to another object."""
return self.fields == other.fields if isinstance(other, StructType) else False
if not isinstance(other, StructType):
return False

if len(self.fields) != len(other.fields):
return False

left = sorted(self.fields, key=lambda field: field.field_id)
right = sorted(other.fields, key=lambda field: field.field_id)
return all(lhs == rhs for lhs, rhs in zip(left, right))


class ListType(IcebergType):
Expand Down
2 changes: 1 addition & 1 deletion tests/avro/test_resolver.py
Original file line number Diff line number Diff line change
Expand Up @@ -372,7 +372,7 @@ def test_writer_ordering() -> None:
),
)

expected = StructWriter(((1, DoubleWriter()), (0, StringWriter())))
expected = StructWriter(((0, DoubleWriter()), (1, StringWriter())))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this change is semantically correct. This test is affected because resolve_writer compares the two given schemas (record_schema and file_schema)

def resolve_writer(
record_schema: Union[Schema, IcebergType],
file_schema: Union[Schema, IcebergType],
) -> Writer:
"""Resolve the file and read schema to produce a reader.
Args:
record_schema (Schema | IcebergType): The schema of the record in memory.
file_schema (Schema | IcebergType): The schema of the file that will be written
Raises:
NotImplementedError: If attempting to resolve an unrecognized object type.
"""
if record_schema == file_schema:
return construct_writer(file_schema)

Previously, comparison returned False due to different ordering


assert actual == expected

Expand Down
22 changes: 10 additions & 12 deletions tests/integration/test_rest_schema.py
Original file line number Diff line number Diff line change
Expand Up @@ -1730,19 +1730,17 @@ def test_move_nested_field_after_first(catalog: Catalog) -> None:
with tbl.update_schema() as schema_update:
schema_update.move_before("struct.data", "struct.count")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this makes me think that the Field ordering does matter...

@Fokko wdyt

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all, thanks for digging into this 🎉

Technically the ordering does not matter when you write the data, because when reading we're correcting the order using this one:

def to_requested_schema(requested_schema: Schema, file_schema: Schema, table: pa.Table) -> pa.Table:

Maybe we should also use that visitor when writing (instead of the PyArrow cast) introduced in #523

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense, thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should also use that visitor when writing (instead of the PyArrow cast) introduced

We're relying on pyarrow cast to translate some pyarrow data types into corresponding Iceberg-supported data types. Such as large_string -> string. Since LargeString is not an Iceberg-supported data type, we cannot use to_requested_schema. Maybe it's possible to cast pyarrow LargeString into Iceberg String.

Copy link
Contributor

@Fokko Fokko May 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want to do that. The String and LargeString are an Arrow encoding detail (similar to the categorial type). Maybe we should have a different version of to_requested_schema where we don't cast, and just keep the original types? If the types are incompatible (for example, the field-id points to a string in the schema, and you try to write a boolean, it should fail).


assert str(tbl.schema()) == str(
Schema(
NestedField(field_id=1, name="id", field_type=LongType(), required=True),
NestedField(
field_id=2,
name="struct",
field_type=StructType(
NestedField(field_id=4, name="data", field_type=StringType(), required=True),
NestedField(field_id=3, name="count", field_type=LongType(), required=True),
),
required=True,
assert tbl.schema() == Schema(
NestedField(field_id=1, name="id", field_type=LongType(), required=True),
NestedField(
field_id=2,
name="struct",
field_type=StructType(
NestedField(field_id=4, name="data", field_type=StringType(), required=True),
NestedField(field_id=3, name="count", field_type=LongType(), required=True),
),
)
required=True,
),
)


Expand Down
9 changes: 9 additions & 0 deletions tests/test_schema.py
Original file line number Diff line number Diff line change
Expand Up @@ -123,6 +123,15 @@ def test_schema_raise_on_duplicate_names() -> None:
assert "Invalid schema, multiple fields for name baz: 3 and 4" in str(exc_info.value)


def test_schema_field_order_irrelevant() -> None:
foo = NestedField(field_id=1, name="foo", field_type=StringType())
bar = NestedField(field_id=2, name="bar", field_type=IntegerType(), required=False)
left = schema.Schema(foo, bar)
right = schema.Schema(bar, foo)
assert left == right
assert left.as_struct() == right.as_struct()


def test_schema_index_by_id_visitor(table_schema_nested: Schema) -> None:
"""Test index_by_id visitor function"""
index = schema.index_by_id(table_schema_nested)
Expand Down