python-v0.19.0: complete CDF support, add column operation, faster MERGE
·
329 commits
to main
since this release
Breaking changes!
Default writer engine has changed to rust. Replace your partition_filters with a predicate (sql) instead. PyArrow engine is deprecated now, and will be removed in v1.0.
Highlights
- CDF support in write_deltalake, delete, and merge operation
- Expired logs cleanup during post-commit. Can be disabled with
delta.enableExpiredLogCleanup = false
- Improved MERGE performance by using predicate non-partition columns min/max for prefiltering
ADD column
operation- Speed up log parsing
Performance improvements
- perf: apply projection when reading checkpoint parquet by @alexwilcoxson-rel in #2717
- perf: grab file size in rust by @ion-elgreco in #2734
- feat: improve merge performance by using predicate non-partition columns min/max for prefiltering by @JonasDev1 in #2513
- perf: early stop if all values in arr are null by @ion-elgreco in #2764
New features
- feat(python, rust): cdc write-support for
delete
operation by @ion-elgreco in #2721 - feat(python, rust): cdc write-support for
overwrite
andreplacewhere
writes by @ion-elgreco in #2722 - feat: introduce CDC generation for merge operations by @rtyler in #2747
- feat: use logical plan in delete, delta planner refactoring by @ion-elgreco in #2725
- feat: use logical plan in update, refactor/simplify CDCTracker by @ion-elgreco in #2727
- feat(python, rust): arrow large/view types passthrough, rust default engine by @ion-elgreco in #2738
- feat(python, rust): cleanup expired logs post-commit hook by @ion-elgreco in #2459
- feat(python, rust):
add column
operation by @ion-elgreco in #2562 - feat(python): handle PyCapsule interface objects in write_deltalake by @kylebarron in #2534
- feat(rust): fix size_in_bytes in last_checkpoint_ to i64 by @sherlockbeard in #2649
- feat(rust,python): cast each parquet file to delta schema by @HawaiianSpork in #2615
- feat: support userMetadata in CommitInfo by @jkylling in #2670
- feat(python, rust): add projection in CDF reads by @ion-elgreco in #2704
- feat(python): add DeltaTable.is_deltatable static method (#2662) by @omkar-foss in #2715
- feat: improved test fixtures by @roeap in #2749
- feat: fail fast on forked process by @Tom-Newton in #2765
- feat: restore the TryFrom for DeltaTablePartition by @rtyler in #2767
- feat: more economic data skipping with datafusion by @roeap in #2772
Bug Fixes
- fix(rust): inconsistent order of partitioning columns (#2494) by @aditanase in #2614
- fix(rust,python): checkpoint with column nullable false by @sherlockbeard in #2680
- fix: update delta kernel version by @jeppe742 in #2685
- fix(python): empty dataset fix for "pyarrow" engine by @sherlockbeard in #2689
- fix: ensure DataFusion SessionState Parquet options are applied to DeltaScan by @alexwilcoxson-rel in #2702
- fix(python, rust): use url encoder when encoding partition values by @ion-elgreco in #2705
- fix(python, rust): use input schema to get correct schema in cdf reads by @ion-elgreco in #2723
- fix: change arrow map root name to follow with parquet root name by @sclmn in #2538
- fix: schema adapter doesn't map partial batches correctly by @alexwilcoxson-rel in #2735
- fix: optimize Spark written tables by @rtyler in #1650
- fix(python, rust): cdc in writer not creating inserts by @ion-elgreco in #2751
- fix(python, rust): don't flatten fields during cdf read by @ion-elgreco in #2763
- fix: column parsing to include nested columns and enclosing char by @gtrawinski in #2737
Other Changes
- chore: missed one macos runner reference in actions by @rtyler in #2645
- chore: add a reproduction case for merge failures with struct by @rtyler in #2644
- ci: update CODEOWNERS by @hntd187 in #2650
- chore: increase subcrate versions by @rtyler in #2648
- docs: fix bullets on hdfs docs by @Kimahriman in #2653
- docs: improve navigation fixes by @avriiil in #2660
- docs: add integration docs for s3 backend by @avriiil in #2658
- chore: bump ruff to 0.5.2 by @fpgmaas in #2673
- chore: enable
RUF
ruleset forruff
by @fpgmaas in #2677 - chore: pin
ruff
andmypy
versions in thelint
stage in the CI pipeline by @fpgmaas in #2679 - chore: update README.md by @veronewra in #2684
- chore: create separate action to setup python and rust in the cicd pipeline by @fpgmaas in #2687
- chore: add test coverage command to
Makefile
by @fpgmaas in #2688 - chore: improve contributing.md by @fpgmaas in #2672
- chore: remove stale code for conditional import of
Literal
by @fpgmaas in #2676 - chore: remove references to black from the project by @fpgmaas in #2674
- chore: refactor
write_deltalake
inwriter.py
by @fpgmaas in #2695 - chore: upgrade to datafusion 40 by @rtyler in #2661
- chore: prepare python release 0.18.3 by @ion-elgreco in #2707
- chore: enabling actions for merge groups by @rtyler in #2718
- chore(deps): update sqlparser requirement from 0.47 to 0.49 by @dependabot in #2714
- chore: try an alternative docke compose invocation syntax by @rtyler in #2724
- chore(deps): update which requirement from 4 to 6 by @dependabot in #2730
- chore: update changelog and versions for next release by @rtyler in #2740
- chore: add to code_owner crates by @ion-elgreco in #2741
- chore: update delta_kernel to 0.3.0 by @alexwilcoxson-rel in #2742
- docs: fix broken link in docs by @astrojuanlu in #2746
- chore: upgrade to datafusion 41 by @rtyler in #2761
- chore: prepare the next notable release of 0.19.0 by @rtyler in #2768
- chore: fix a bunch of clippy lints and re-enable tests by @rtyler in #2773
New Contributors
- @aditanase made their first contribution in #2614
- @fpgmaas made their first contribution in #2673
- @kylebarron made their first contribution in #2534
- @veronewra made their first contribution in #2684
- @jeppe742 made their first contribution in #2685
- @sclmn made their first contribution in #2538
- @astrojuanlu made their first contribution in #2746
- @gtrawinski made their first contribution in #2737
Full Changelog: python-v0.18.2...python-v0.19.0