Skip to content

Adding promotion for UnknownType per V3+ spec #2155

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

ldsantos0911
Copy link

Rationale for this change

When attempting to write a PyArrow dataframe that has a null field corresponding to a nullable Table field, I see this error:

ValueError: Mismatch in fields:
┏━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃    ┃ Table field              ┃ Dataframe field           ┃
┡━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ ❌ │ 1: col1: optional string │ 1: col1: optional unknown │
│ ✅ │ 2: col2: optional string │ 2: col2: optional string  │
└────┴──────────────────────────┴───────────────────────────┘

Per the V3 spec, UnknownType (as pa.null() is cast) should be promotable to any type. This change attempts to unblock situations where a DataFrame may end up having a null type for an optional field while also incorporating the V3 spec.

Note: This issue is related but was based on an older version. Nonetheless, the underlying situation is still blocked.

Are these changes tested?

Yes. Added a couple of relevant unit tests. Additionally, I did a manual sanity check:

Python 3.10.18 (main, Jun  4 2025, 08:06:20) [Clang 16.0.0 (clang-1600.0.26.3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow as pa
>>> from pyiceberg.catalog import load_catalog
>>> from pyiceberg.schema import Schema
>>> from pyiceberg.types import StringType, NestedField
>>> warehouse_path = '/tmp/warehouse'
>>> catalog = load_catalog('default', **{'type': 'sql', 'uri': f"sqlite:///{warehouse_path}/pyiceberg_catalog.db", 'warehouse': f"file://{warehouse_path}"})
>>> table = catalog.load_table('default.test2')
>>> table.schema()
Schema(NestedField(field_id=1, name='col1', field_type=StringType(), required=False), NestedField(field_id=2, name='col2', field_type=StringType(), required=False), schema_id=0, identifier_field_ids=[])
>>> df = pa.Table.from_pylist([{'col1': None, 'col2': 'blah'}])
>>> df.schema
col1: null
col2: string
>>> table.append(df)
>>> x = table.scan().to_arrow()
>>> x
pyarrow.Table
col1: large_string
col2: string
----
col1: [[null]]
col2: [["blah"]]

Are there any user-facing changes?

Yes, this will allow UnknownType promotion.

@ldsantos0911 ldsantos0911 marked this pull request as ready for review June 26, 2025 21:56
@Fokko Fokko self-requested a review June 27, 2025 06:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant