Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configurable column encoding for parquet checkpoint files to address Fabric limitation #3212

Open
dmunch opened this issue Feb 12, 2025 · 1 comment · May be fixed by #3214
Open

Configurable column encoding for parquet checkpoint files to address Fabric limitation #3212

dmunch opened this issue Feb 12, 2025 · 1 comment · May be fixed by #3214
Labels
enhancement New feature or request

Comments

@dmunch
Copy link

dmunch commented Feb 12, 2025

Description

I'd like to be able to to specify the column encoding the parquet writer is using for the creation of the checkpoint files.

Currently, the writer properties are hard-coded in checkpoints.rs

Use Case

Microsoft Fabric currently has a limitation and doesn't support run length encoded parquet files for the checkpoint parquet files. The current checkpoint files make the SQL analytics endpoint error when trying to read a delta lake table created by delta-rs which includes a check-point.

Workaround

I currently use a post-processing step to remove the encoding of the the checkpoint parquet file like this.

def parquet_file_convert_encoding_to_plain(file_path: str):
    import pyarrow.parquet as pq
    import os

    table = pq.read_table(file_path)
    
    # Write the table to a temporary file using the new encoding properties
    # which force the use of PLAIN encoding
    tmp_file = file_path + ".tmp"
    pq.write_table(table, tmp_file, use_dictionary=False, column_encoding="PLAIN")

    # Replace the original file with the new one
    os.replace(tmp_file, file_path)

Fabric is happily reading the delta lake table once the checkpoint file has been post-processed this way.

Related Issue(s)

@dmunch dmunch added the enhancement New feature or request label Feb 12, 2025
@ion-elgreco
Copy link
Collaborator

You can try putting a PR to expose writerProperties on create_checkpoint

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants