Compare `Schema` and `StructType` fields irrespective of ordering #700

kevinjqliu · 2024-05-04T18:19:52Z

Fixes #674

Schema and StructType fields variable is represented by Tuple, which means that ordering matters when performing comparison.

Two Schemas with the same fields in different order should be consider the same

iceberg-python/pyiceberg/schema.py

Line 88 in 7bd5d9e

fields: Tuple[NestedField, ...] = Field(default_factory=tuple)

iceberg-python/pyiceberg/types.py

Line 346 in 7bd5d9e

fields: Tuple[NestedField, ...] = Field(default_factory=tuple)

kevinjqliu · 2024-05-04T18:50:25Z

tests/avro/test_resolver.py

@@ -372,7 +372,7 @@ def test_writer_ordering() -> None:
        ),
    )

-    expected = StructWriter(((1, DoubleWriter()), (0, StringWriter())))
+    expected = StructWriter(((0, DoubleWriter()), (1, StringWriter())))


Not sure if this change is semantically correct. This test is affected because resolve_writer compares the two given schemas (record_schema and file_schema)

iceberg-python/pyiceberg/avro/resolver.py

Lines 200 to 214 in 7bd5d9e

def resolve_writer(

record_schema: Union[Schema, IcebergType],

file_schema: Union[Schema, IcebergType],

) -> Writer:

"""Resolve the file and read schema to produce a reader.

Args:

record_schema (Schema | IcebergType): The schema of the record in memory.

file_schema (Schema | IcebergType): The schema of the file that will be written

Raises:

NotImplementedError: If attempting to resolve an unrecognized object type.

"""

if record_schema == file_schema:

return construct_writer(file_schema)

Previously, comparison returned False due to different ordering

kevinjqliu · 2024-05-04T18:59:53Z

tests/integration/test_rest_schema.py

@@ -1730,19 +1730,17 @@ def test_move_nested_field_after_first(catalog: Catalog) -> None:
    with tbl.update_schema() as schema_update:
        schema_update.move_before("struct.data", "struct.count")


this makes me think that the Field ordering does matter...

@Fokko wdyt

First of all, thanks for digging into this 🎉

Technically the ordering does not matter when you write the data, because when reading we're correcting the order using this one:

iceberg-python/pyiceberg/io/pyarrow.py

Line 1143 in d02d7a1

def to_requested_schema(requested_schema: Schema, file_schema: Schema, table: pa.Table) -> pa.Table:

Maybe we should also use that visitor when writing (instead of the PyArrow cast) introduced in #523

make sense, thanks!

Maybe we should also use that visitor when writing (instead of the PyArrow cast) introduced

We're relying on pyarrow cast to translate some pyarrow data types into corresponding Iceberg-supported data types. Such as large_string -> string. Since LargeString is not an Iceberg-supported data type, we cannot use to_requested_schema. Maybe it's possible to cast pyarrow LargeString into Iceberg String.

I don't think we want to do that. The String and LargeString are an Arrow encoding detail (similar to the categorial type). Maybe we should have a different version of to_requested_schema where we don't cast, and just keep the original types? If the types are incompatible (for example, the field-id points to a string in the schema, and you try to write a boolean, it should fail).

kevinjqliu · 2024-06-18T23:42:35Z

Closing in favor of #829

kevinjqliu added 4 commits May 4, 2024 13:14

add test

8c61199

compare fields with set

79c73eb

fix StructType eq

9b27631

sort and compare

eae6206

kevinjqliu commented May 4, 2024

View reviewed changes

kevinjqliu added 2 commits May 4, 2024 14:50

inline function

00cc61a

fix integration test

f16f284

kevinjqliu commented May 4, 2024

View reviewed changes

Fokko mentioned this pull request Jun 2, 2024

ValueError: Mismatch in fields: ? #674

Closed

kevinjqliu closed this Jun 18, 2024

kevinjqliu deleted the kevinjqliu/schema-field-order branch June 19, 2024 01:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Compare `Schema` and `StructType` fields irrespective of ordering #700

Compare `Schema` and `StructType` fields irrespective of ordering #700

Uh oh!

kevinjqliu commented May 4, 2024

Uh oh!

kevinjqliu May 4, 2024

Uh oh!

kevinjqliu May 4, 2024

Uh oh!

Fokko May 9, 2024

Uh oh!

kevinjqliu May 9, 2024

Uh oh!

kevinjqliu May 9, 2024

Uh oh!

Fokko May 29, 2024 •

edited

Loading

Uh oh!

kevinjqliu commented Jun 18, 2024

Uh oh!

Uh oh!

	def resolve_writer(
	record_schema: Union[Schema, IcebergType],
	file_schema: Union[Schema, IcebergType],
	) -> Writer:
	"""Resolve the file and read schema to produce a reader.

	Args:
	record_schema (Schema \| IcebergType): The schema of the record in memory.
	file_schema (Schema \| IcebergType): The schema of the file that will be written

	Raises:
	NotImplementedError: If attempting to resolve an unrecognized object type.
	"""
	if record_schema == file_schema:
	return construct_writer(file_schema)

		@@ -1730,19 +1730,17 @@ def test_move_nested_field_after_first(catalog: Catalog) -> None:
		with tbl.update_schema() as schema_update:
		schema_update.move_before("struct.data", "struct.count")

Compare Schema and StructType fields irrespective of ordering #700

Compare Schema and StructType fields irrespective of ordering #700

Uh oh!

Conversation

kevinjqliu commented May 4, 2024

Uh oh!

kevinjqliu May 4, 2024

Choose a reason for hiding this comment

Uh oh!

kevinjqliu May 4, 2024

Choose a reason for hiding this comment

Uh oh!

Fokko May 9, 2024

Choose a reason for hiding this comment

Uh oh!

kevinjqliu May 9, 2024

Choose a reason for hiding this comment

Uh oh!

kevinjqliu May 9, 2024

Choose a reason for hiding this comment

Uh oh!

Fokko May 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kevinjqliu commented Jun 18, 2024

Uh oh!

Uh oh!

Compare `Schema` and `StructType` fields irrespective of ordering #700

Compare `Schema` and `StructType` fields irrespective of ordering #700

Fokko May 29, 2024 •

edited

Loading