|
| 1 | +--- |
| 2 | +title: "Release 1.0.0" |
| 3 | +sidebar_position: 1 |
| 4 | +layout: releases |
| 5 | +toc: true |
| 6 | +--- |
| 7 | + |
| 8 | +import Tabs from '@theme/Tabs'; |
| 9 | +import TabItem from '@theme/TabItem'; |
| 10 | + |
| 11 | +## [Release 1.0.0](https://github.com/apache/hudi/releases/tag/release-1.0.0) ([docs](/docs/quick-start-guide)) |
| 12 | + |
| 13 | +Apache Hudi 1.0.0 is a major milestone release of Apache Hudi. This release contains significant format changes and new exciting features |
| 14 | +as we will see below. |
| 15 | + |
| 16 | +## Migration Guide |
| 17 | + |
| 18 | +We encourage users to try the **1.0.0** features on new tables first. The 1.0 general availability (GA) release will |
| 19 | +support automatic table upgrades from 0.x versions while also ensuring full backward compatibility when reading 0.x |
| 20 | +Hudi tables using 1.0, ensuring a seamless migration experience. |
| 21 | + |
| 22 | +This release comes with **backward compatible writes** i.e. 1.0.0 can write in both the table version 8 (latest) and older |
| 23 | +table version 6 (corresponds to 0.14 & above) formats. Automatic upgrades for tables from 0.x versions are fully |
| 24 | +supported, minimizing migration challenges. Until all the readers are upgraded, users can still deploy 1.0.0 binaries |
| 25 | +for the writers and leverage backward compatible writes to continue writing the tables in the older format. Once the readers |
| 26 | +are fully upgraded, users can switch to the latest format through a config change. We recommend users to follow the upgrade |
| 27 | +steps mentioned in the [migration guide](/docs/deployment#upgrading-to-100) to ensure a smooth transition. |
| 28 | + |
| 29 | +:::caution |
| 30 | +Most things are seamlessly handled by the auto upgrade process, but there are some limitations. Please read through the |
| 31 | +limitations of the upgrade downgrade process before proceeding to migrate. Please check the [migration guide](/docs/deployment#upgrading-to-100) |
| 32 | +and [RFC-78](https://github.com/apache/hudi/blob/master/rfc/rfc-78/rfc-78.md#support-matrix-for-different-readers-and-writers) for more details. |
| 33 | +::: |
| 34 | + |
| 35 | +## Bundle Updates |
| 36 | + |
| 37 | + - Same bundles supported in the [0.15.0 release](release-0.15.0#new-spark-bundles) are still supported. |
| 38 | + - New Flink Bundles to support Flink 1.19 and Flink 1.20. |
| 39 | + - In this release, we have deprecated support for Spark 3.2 or lower version in Spark 3. |
| 40 | + |
| 41 | +## Highlights |
| 42 | + |
| 43 | +### Format changes |
| 44 | + |
| 45 | +The main epic covering all the format changes is [HUDI-6242](https://issues.apache.org/jira/browse/HUDI-6242), which is also |
| 46 | +covered in the [Hudi 1.0 tech specification](/tech-specs-1point0). The following are the main highlights with respect to format changes: |
| 47 | + |
| 48 | +#### Timeline |
| 49 | + |
| 50 | +- The active and archived timeline dichotomy has been done away with a more scalable LSM tree based |
| 51 | + timeline. The timeline layout is now more organized and efficient for time-range queries and scaling to infinite history. |
| 52 | +- As a result, timeline layout has been changed, and it has been moved to `.hoodie/timeline` directory under the base |
| 53 | + path of the table. |
| 54 | +- There are changes to the timeline instant files as well: |
| 55 | + - All commit metadata is serialized to Avro, allowing for future compatibility and uniformity in metadata across all |
| 56 | + actions. |
| 57 | + - Instant files for completed actions now include a completion time. |
| 58 | + - Action for the pending clustering instant is now renamed to `clustering` to make it distinct from other |
| 59 | + `replacecommit` actions. |
| 60 | + |
| 61 | +#### Log File Format |
| 62 | + |
| 63 | +- In addition to the keys in the log file header, we also store record positions. Refer to the |
| 64 | + latest [spec](/tech-specs-1point0#log-format) for more details. This allows us to do position-based merging (apart |
| 65 | + from key-based merging) and skip pages based on positions. |
| 66 | +- Log file name will now have the deltacommit instant time instead of base commit instant time. |
| 67 | +- The new log file format also enables fast partial updates with low storage overhead. |
| 68 | + |
| 69 | +### Compatibility with Old Formats |
| 70 | + |
| 71 | +- **Backward Compatible writes:** Hudi 1.0 writes now support writing in both the table version 8 (latest) and older table version 6 (corresponds to 0.14 & above) formats, ensuring seamless |
| 72 | + integration with existing setups. |
| 73 | +- **Automatic upgrades**: for tables from 0.x versions are fully supported, minimizing migration challenges. We also recommend users first try migrating to 0.14 & |
| 74 | + above, if you have advanced setups with multiple readers/writers/table services. |
| 75 | + |
| 76 | +### Concurrency Control |
| 77 | + |
| 78 | +1.0.0 introduces **Non-Blocking Concurrency Control (NBCC)**, enabling multi-stream concurrent ingestion without |
| 79 | +conflict. This is a general-purpose concurrency model aimed at the stream processing or high-contention/frequent writing |
| 80 | +scenarios. In contrast to Optimistic Concurrency Control, where writers abort the transaction if there is a hint of |
| 81 | +contention, this innovation allows multiple streaming writes to the same Hudi table without any overhead of conflict |
| 82 | +resolution, while keeping the semantics of event-time ordering found in streaming systems, along with asynchronous table |
| 83 | +service such as compaction, archiving and cleaning. |
| 84 | + |
| 85 | +To learn more about NBCC, refer to [this blog](/blog/2024/12/06/non-blocking-concurrency-control) which also includes a demo with Flink writers. |
| 86 | + |
| 87 | +### New Indices |
| 88 | + |
| 89 | +1.0.0 introduces new indices to the multi-modal indexing subsystem of Apache Hudi. These indices are designed to improve |
| 90 | +query performance through partition pruning and further data skipping. |
| 91 | + |
| 92 | +#### Secondary Index |
| 93 | + |
| 94 | +The **secondary index** allows users to create indexes on columns that are not part of record key columns in Hudi |
| 95 | +tables. It can be used to speed up queries with predicates on columns other than record key columns. |
| 96 | + |
| 97 | +#### Partition Stats Index |
| 98 | + |
| 99 | +The **partition stats index** aggregates statistics at the partition level for the columns for which it is enabled. This |
| 100 | +helps in efficient partition pruning even for non-partition fields. |
| 101 | + |
| 102 | +#### Expression Index |
| 103 | + |
| 104 | +The **expression index** enables efficient queries on columns derived from expressions. It can collect stats on columns |
| 105 | +derived from expressions without materializing them, and can be used to speed up queries with filters containing such |
| 106 | +expressions. |
| 107 | + |
| 108 | +To learn more about these indices, refer to the [SQL queries](/docs/sql_queries#snapshot-query-with-index-acceleration) docs. |
| 109 | + |
| 110 | +### Partial Updates |
| 111 | + |
| 112 | +1.0.0 extends support for partial updates to Merge-on-Read tables, which allows users to update only a subset of columns |
| 113 | +in a record. This feature is useful when users want to update only a few columns in a record without rewriting the |
| 114 | +entire record. |
| 115 | + |
| 116 | +To learn more about partial updates, refer to the [SQL DML](/docs/sql_dml#merge-into-partial-update) docs. |
| 117 | + |
| 118 | +### Multiple Base File Formats in a single table |
| 119 | + |
| 120 | +- Support for multiple base file formats (e.g., **Parquet**, **ORC**, **HFile**) within a single Hudi table, allowing |
| 121 | + tailored formats for specific use cases like indexing and ML applications. |
| 122 | +- It is also useful when users want to switch from one file |
| 123 | + format to another, e.g. from ORC to Parquet, without rewriting the whole table. |
| 124 | +- **Configuration:** Enable with `hoodie.table.multiple.base.file.formats.enable`. |
| 125 | + |
| 126 | +To learn more about the format changes, refer to the [Hudi 1.0 tech specification](/tech-specs-1point0). |
| 127 | + |
| 128 | +### API Changes |
| 129 | + |
| 130 | +1.0.0 introduces several API changes, including: |
| 131 | + |
| 132 | +#### Record Merger API |
| 133 | + |
| 134 | +`HoodieRecordPayload` interface is deprecated in favor of the new `HoodieRecordMerger` interface. Record merger is a |
| 135 | +generic interface that allows users to define custom logic for merging base file and log file records. This release |
| 136 | +comes with a few out-of-the-box merge modes, which define how the base and log files are ordered in a file slice and |
| 137 | +further how different records with the same record key within that file slice are merged consistently to produce the |
| 138 | +same deterministic results for snapshot queries, writers and table services. Specifically, there are three merge modes |
| 139 | +supported as a table-level configuration: |
| 140 | + |
| 141 | +- `COMMIT_TIME_ORDERING`: Merging simply picks the record belonging to the latest write (commit time) as the merged |
| 142 | + result. |
| 143 | +- `EVENT_TIME_ORDERING`: Merging picks the record with the highest value on a user specified ordering or precombine |
| 144 | + field as the merged result. |
| 145 | +- `CUSTOM`: Users can provide custom merger implementation to have better control over the merge logic. |
| 146 | + |
| 147 | +:::note |
| 148 | +Going forward, we recommend users to migrate and use the record merger APIs and not write new payload implementations. |
| 149 | +::: |
| 150 | + |
| 151 | +#### Positional Merging with Filegroup Reader |
| 152 | + |
| 153 | +- **Position-Based Merging:** Offers an alternative to key-based merging, allowing for page skipping based on record |
| 154 | + positions. Enabled by default for Spark and Hive. |
| 155 | +- **Configuration:** Activate positional merging using `hoodie.merge.use.record.positions=true`. |
| 156 | + |
| 157 | +The new reader has shown impressive performance gains for **partial updates** with key-based merging. For a MOR table of |
| 158 | +size 1TB with 100 partitions and 80% random updates in subsequent commits, the new reader is **5.7x faster** for |
| 159 | +snapshot queries with **70x reduced write amplification**. |
| 160 | + |
| 161 | +### Flink Enhancements |
| 162 | + |
| 163 | +- **Lookup Joins:** Flink now supports lookup joins, enabling table enrichment with external data sources. |
| 164 | +- **Partition Stats Index Support:** As mentioned above, partition stats support is now available for Flink, bringing |
| 165 | + efficient partition pruning to streaming workloads. |
| 166 | +- **Non-Blocking Concurrency Control:** NBCC is now available for Flink streaming writers, allowing for multi-stream |
| 167 | + concurrent ingestion without conflict. |
| 168 | + |
| 169 | +## Call to Action |
| 170 | + |
| 171 | +The 1.0.0 GA release is the culmination of extensive development, testing, and feedback. We invite you to upgrade and |
| 172 | +experience the new features and enhancements. |
0 commit comments