Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: configurable column encoding for parquet checkpoint files #3214

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

dmunch
Copy link

@dmunch dmunch commented Feb 13, 2025

Description

This PR adds a table configuration option to enable or disable run length encoding for checkpoint files.

Note: I'm unsure if the table option is the right way to go - In the original issue it was propose to expose writerProperties on create_checkpoint, however, after evaluating this, I figured this has a few downsides:

  • The automatic checkpointing wouldn't be aware of those custom writerProperties
  • writerProperties would expose too much control, i.e. I don't really need that level of control over the checkpoint writing
  • bindings would need to be updated (I'm using python and C# bindings, so that would add additional work)

Instead I went down the route of table properties, unsure however if the Delta Lake spec allows implementation specific table properties, or if these should be a well-defined set of properties. Since in doubt, I've prefixed the table property with delta-rs instead of delta. Happy to discuss!

If the approach is validated I can update/add documentation as well.

Related Issue(s)

Documentation

@github-actions github-actions bot added the binding/rust Issues for the Rust crate label Feb 13, 2025
@dmunch dmunch force-pushed the feat/checkpoint-file-rle-encoding branch from 0a8c1ef to 4ddbb45 Compare February 13, 2025 14:13
Copy link

codecov bot commented Feb 13, 2025

Codecov Report

Attention: Patch coverage is 92.30769% with 1 line in your changes missing coverage. Please review.

Project coverage is 72.25%. Comparing base (b3efdfc) to head (433027b).

Files with missing lines Patch % Lines
crates/core/src/protocol/checkpoints.rs 90.90% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3214      +/-   ##
==========================================
+ Coverage   72.10%   72.25%   +0.15%     
==========================================
  Files         138      138              
  Lines       45320    45324       +4     
  Branches    45320    45324       +4     
==========================================
+ Hits        32678    32750      +72     
+ Misses      10567    10497      -70     
- Partials     2075     2077       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ion-elgreco
Copy link
Collaborator

Makes sense, and I think it's fine to use custom properties. Other connectors also have specific configurations, however the put them under delta, for example Trino: https://trino.io/docs/current/connector/delta-lake.html#delta-lake-general-configuration-properties

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/rust Issues for the Rust crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Configurable column encoding for parquet checkpoint files to address Fabric limitation
2 participants