Fix/sparse schema and allow column drops#15
Conversation
Tests cover: - validate_schema_changes incorrectly raises on sparse data - cast_table_safe crashes when source is missing target columns - apply_schema_evolution never deletes columns for allow_column_drops=True
schema_evolution.py: - Remove SchemaEvolutionError for dropped columns. allow_column_drops=False leaves columns in schema with nulls at write time; allow_column_drops=True removes them via apply_schema_evolution. - Add dropped_fields parameter to apply_schema_evolution, calling update.delete_column() for each field. - evolve_schema_if_needed logs sparse columns as warning when allow_column_drops=False and returns early without schema changes. Passes dropped_fields to apply_schema_evolution only when allow_column_drops=True. schema_casting.py: - cast_table_safe adds null columns (pa.nulls) for any field in the target schema missing from the source table before table.select(), completing the behavior validate_cast already documents.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d21214d771
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| table = table.append_column( | ||
| target_field, | ||
| pa.nulls(len(table), type=target_field.type), | ||
| ) |
There was a problem hiding this comment.
Reject omitted required columns instead of null-filling
When the existing Iceberg schema has a required column (for example from a dlt column with nullable=False) and a later sparse batch omits it, this path appends an all-null array for that non-nullable target field. validate_cast only treats missing target fields as warnings, so even strict_casting=True will proceed and produce rows that violate the required Iceberg schema, leading to append failures or invalid data. Please only fill nullable/defaulted fields, and raise a casting error for omitted required columns.
Useful? React with 👍 / 👎.
Hi Nico, side note, thanks for your work on the
_dlt_loadstable, I was looking for it, and all I had to do was install your latest release 😄The 'sparse data' problem may sound similar to my previous PR: #10
But that PR addressed the scenario where dlt didn't know about the existing wide schema in a fresh container. Once the schema is known (either from
_dlt_versionor derivation), I discovered these downstream problems:Problems
When ingesting sparse data (subsequent runs with fewer columns than the established schema), 3 problems surface:
False
SchemaEvolutionError--validate_schema_changesraises when columns are "dropped" (present in table but absent in incoming data), even though the incoming data is not requesting a schema change. The columns should remain in the Iceberg table and new rows should receive nulls.cast_table_safecrashes on missing columns -- Even if the error above were bypassed,cast_table_safecallstable.select(target_field_names)which raises aKeyErrorfor columns in the target Iceberg schema that don't exist in the source Arrow table.validate_castalready documents the correct intent -- "Field X exists in target but not in source (will be null)" -- but the cast logic never followed through.allow_column_drops=Truewas a no-op --dropped_fieldswas computed and validated but never passed toapply_schema_evolution, so columns were never actually removed from the Iceberg schema regardless of the flag value.Solution
allow_column_drops=False, default): the Iceberg table schema stays unchanged and new rows receivenullfor columns they don't contain.allow_column_drops=True): columns missing from incoming data are removed from the Iceberg schema viaupdate.delete_column().destination.py--allow_column_drops=Falseremains the correct default.Changes
schema_evolution.pyvalidate_schema_changesSchemaEvolutionErrorfor dropped columns. Neither case warrants an error --allow_column_drops=Trueremoves columns viaapply_schema_evolution, andallow_column_drops=Falseleaves them in the schema with nulls filled at write time.apply_schema_evolutiondropped_fieldsparameter. When provided, callsupdate.delete_column()for each field -- the actual implementation ofallow_column_drops=Truethat was previously missing.evolve_schema_if_neededallow_column_drops=False. Passesdropped_fieldstoapply_schema_evolutiononly whenallow_column_drops=True. Returns early without schema changes for the sparse case.schema_casting.pycast_table_safetable.select(target_field_names), adds a null column (pa.nulls) for any field in the target schema missing from the source table. Completes whatvalidate_castalready documents as the intended behavior.Tests
test_sparse_schema.pycovering all three problemstest_schema_evolution.pyupdated to reflect corrected behavior