Schema evolution at the column level. #3171

CrispyCrafter · 2025-01-30T06:45:04Z

CrispyCrafter
Jan 30, 2025

We've been running DeltaLake with S3 backing in production for a few months now with excellent results.
Yesterday we came to realise that an opinionated decision to cast one column to int as apposed to float upstream from delta-lake introduced marginal errors in a analytical workflow. To my surprise it does not seem that DeltaLake natively supports updating column type schemas in this situation. There are obviously certain data-types that cannot be cast in this way, however, int to float is a perfectly valid operation.

The options, as far as I could tell, to resolve this is to rebuild the table entirely, i.e. using overwrite mode.
This is expensive wasteful and technically not feasible given the volume of data present in s3.

Instead I opted to manually modify both <>.checkpoint.parquet and _last_checkpoint which seems to have done the trick.
Here is the pseudo workflow that I used:

import pyarrow as pa
from pyarrow import parquet

#Read exisinting checkpoint
data = parquet.read_table('production_ckeckpoint.parquet')
existing_scalar = data['metaData'][1]
schemaString = data['metaData'][1]['schemaString']

#Convert the existing scalar to a dictionary
struct_dict = {field_name: existing_scalar[field_name] for field_name in existing_scalar}

#Update schemaString
updated_schema_scalar = pa.scalar(updated_schema_string, type=pa.string())
struct_dict['schemaString'] = updated_schema_scalar

#Create a new StructScalar with the modified values
new_scalar = pa.scalar(struct_dict, type=existing_scalar.type)

#Now to update the table:
#First, convert the metadata column to a list
metadata_list = data['metaData'].to_pylist()

#Update the specific row (index 1 in this case)
metadata_list[1] = new_scalar

#Create a new array with the updated data
new_metadata_array = pa.array(metadata_list, type=data['metaData'].type)

#Create a new table with the updated column
new_table = data.set_column(
    data.schema.get_field_index('metaData'),  # Get the index of metaData column
    'metaData',
    new_metadata_array
)

#Write updated checkpoint to disk, subsequently write this to s3
parquet.write_table(new_table, '_checkpoint_new.parquet')

This workflow updated the schema for all downstream consumers such that the type is now listed as float in the DeltaTable interface.

Side note - we use the rust writer, in append mode and schema mode merge

Surely there has to be a better, native way to support this kind of operation?

ion-elgreco · 2025-02-08T16:38:30Z

ion-elgreco
Feb 8, 2025
Collaborator

There is no native way for this currently. The feature you are looking for is called type widening.

I think you have this strong use case, you could try to contribute this feature to our schema evolution code.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Schema evolution at the column level. #3171

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Schema evolution at the column level. #3171

CrispyCrafter Jan 30, 2025

Replies: 1 comment

ion-elgreco Feb 8, 2025 Collaborator

CrispyCrafter
Jan 30, 2025

ion-elgreco
Feb 8, 2025
Collaborator