Inconsistent PyArrow Schema Field Metadata on `project_table`: Parquet Field ID

### Apache Iceberg version

None

### Please describe the bug 🐞

While refactoring `project_table`(https://github.com/apache/iceberg-python/pull/786) I ran into some issues with the tests because the existing behavior for the `project_table` function isn’t consistent in terms of whether or not it returns the Parquet Field ID in its pyarrow schema field metadata.

There are cases where the parquet field ID is attached to the field metadata, and cases where they aren’t: https://github.com/apache/iceberg-python/blob/main/tests/io/test_pyarrow.py#L1062-L1080

I think this is because we use `schema_to_pyarrow` as a fallback schema which attaches the parquet field ID attribute onto the field metadata: https://github.com/apache/iceberg-python/blob/main/pyiceberg/io/pyarrow.py#L1133

I think we should correct this behavior so that it is consistent for all table scans.

- Do we want to attach the parquet file ID attribute on all pyarrow schema returned by `project_table`?
- Or should we remove parquet file ID attached on the field metadata of the pyarrow schema? The idea here is that we would have two modes of creating `schema_to_pyarrow` , with or without parquet Field ID (write, versus read use cases)

I think not having unintended metadata for a specific use case will be cleaner for the users. Parquet Field ID was added to `schema_to_pyarrow` so that we could persist the field ID into the parquet files on write. But we do not want them when we are reading the Table. Hence, I am leaning towards the second option. 

Looking for some thoughts and direction on this issue so we can complete the refactoring to support `Iterator[RecordBatch]` output scans! @Fokko @HonahX 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Inconsistent PyArrow Schema Field Metadata on `project_table`: Parquet Field ID #788

Apache Iceberg version

Please describe the bug 🐞

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Inconsistent PyArrow Schema Field Metadata on project_table: Parquet Field ID #788

Description

Apache Iceberg version

Please describe the bug 🐞

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Inconsistent PyArrow Schema Field Metadata on `project_table`: Parquet Field ID #788