Skip to content

Commit 7529570

Browse files
[DOCS] Add release highlights for 1.0.0 release (#12475)
* Add release highlights for 1.0.0 release * Code Review comments for release-1.0.0.md * Fix links and address review comments wrt upgrading * Add limitations --------- Co-authored-by: vinoth chandar <[email protected]>
1 parent 88e527c commit 7529570

File tree

7 files changed

+230
-23
lines changed

7 files changed

+230
-23
lines changed

website/docs/concurrency_control.md

+5
Original file line numberDiff line numberDiff line change
@@ -214,6 +214,11 @@ currently available for preview in version 1.0.0-beta only with the caveat that
214214
between clustering and ingestion. It works for compaction and ingestion, and we can see an example of that with Flink
215215
writers [here](sql_dml#non-blocking-concurrency-control-experimental).
216216

217+
:::note
218+
`NON_BLOCKING_CONCURRENCY_CONTROL` between ingestion writer and table service writer is not yet supported for clustering.
219+
Please use `OPTIMISTIC_CONCURRENCY_CONTROL` for clustering.
220+
:::
221+
217222
## Early conflict Detection
218223

219224
Multi writing using OCC allows multiple writers to concurrently write and atomically commit to the Hudi table if there is no overlapping data file to be written, to guarantee data consistency, integrity and correctness. Prior to 0.13.0 release, as the OCC (optimistic concurrency control) name suggests, each writer will optimistically proceed with ingestion and towards the end, just before committing will go about conflict resolution flow to deduce overlapping writes and abort one if need be. But this could result in lot of compute waste, since the aborted commit will have to retry from beginning. With 0.13.0, Hudi introduced early conflict deduction leveraging markers in hudi to deduce the conflicts eagerly and abort early in the write lifecyle instead of doing it in the end. For large scale deployments, this might avoid wasting lot o compute resources if there could be overlapping concurrent writers.

website/docs/deployment.md

+26
Original file line numberDiff line numberDiff line change
@@ -165,6 +165,32 @@ As general guidelines,
165165

166166
Note that release notes can override this information with specific instructions, applicable on case-by-case basis.
167167

168+
### Upgrading to 1.0.0
169+
170+
1.0.0 is a major release with significant format changes. To ensure a smooth migration experience, we recommend the
171+
following steps:
172+
173+
1. Stop any async table services in 0.x completely.
174+
2. Upgrade writers to 1.x with table version (tv) 6, `autoUpgrade` and metadata disabled (this won't auto-upgrade anything);
175+
0.x readers will continue to work; writers can also be readers and will continue to read both tv=6.
176+
a. Set `hoodie.write.auto.upgrade` to false.
177+
b. Set `hoodie.metadata.enable` to false.
178+
3. Upgrade table services to 1.x with tv=6, and resume operations.
179+
4. Upgrade all remaining readers to 1.x, with tv=6.
180+
5. Redeploy writers with tv=8; table services and readers will adapt/pick up tv=8 on the fly.
181+
6. Once all readers and writers are in 1.x, we are good to enable any new features, including metadata, with 1.x tables.
182+
183+
During the upgrade, metadata table will not be updated and it will be behind the data table. It is important to note
184+
that metadata table will be updated only when the writer is upgraded to tv=8. So, even the readers should keep metadata
185+
disabled during rolling upgrade until all writers are upgraded to tv=8.
186+
187+
:::caution
188+
Most things are seamlessly handled by the auto upgrade process, but there are some limitations. Please read through the
189+
limitations of the upgrade downgrade process before proceeding to migrate. Please
190+
check [RFC-78](https://github.com/apache/hudi/blob/master/rfc/rfc-78/rfc-78.md#support-matrix-for-different-readers-and-writers)
191+
for more details.
192+
:::
193+
168194
## Downgrading
169195

170196
Upgrade is automatic whenever a new Hudi version is used whereas downgrade is a manual step. We need to use the Hudi

website/docs/sql_ddl.md

+17-21
Original file line numberDiff line numberDiff line change
@@ -272,6 +272,7 @@ Both index and column on which the index is created can be qualified with some o
272272
Please note in order to create secondary index:
273273
1. The table must have a primary key and merge mode should be [COMMIT_TIME_ORDERING](/docs/next/record_merger#commit_time_ordering).
274274
2. Record index must be enabled. This can be done by setting `hoodie.metadata.record.index.enable=true` and then creating `record_index`. Please note the example below.
275+
3. Secondary index is not supported for [complex types](https://avro.apache.org/docs/1.11.1/specification/#complex-types).
275276
:::
276277

277278
**Examples**
@@ -334,12 +335,18 @@ date based partitioning, provide same benefits to queries, even if the physical
334335
CREATE INDEX IF NOT EXISTS ts_datestr ON hudi_table
335336
USING column_stats(ts)
336337
OPTIONS(expr='from_unixtime', format='yyyy-MM-dd');
337-
-- Create a expression index on the column `ts` (timestamp in yyyy-MM-dd HH:mm:ss) of the table `hudi_table` using the function `hour`
338+
-- Create an expression index on the column `ts` (timestamp in yyyy-MM-dd HH:mm:ss) of the table `hudi_table` using the function `hour`
338339
CREATE INDEX ts_hour ON hudi_table
339340
USING column_stats(ts)
340341
options(expr='hour');
341342
```
342343

344+
:::note
345+
1. Expression index can only be created for Spark engine using SQL. It is not supported yet with Spark DataSource API.
346+
2. Expression index is not yet supported for [complex types](https://avro.apache.org/docs/1.11.1/specification/#complex-types).
347+
3. Expression index is supported for unary and certain binary expressions. Please check [SQL DDL docs](sql_ddl#create-expression-index) for more details.
348+
:::
349+
343350
The `expr` option is required for creating expression index, and it should be a valid Spark SQL function. Please check the syntax
344351
for the above functions in the [Spark SQL documentation](https://spark.apache.org/docs/latest/sql-ref-functions.html) and provide the options accordingly. For example,
345352
the `format` option is required for `from_unixtime` function.
@@ -434,18 +441,21 @@ and execution.
434441

435442
To enable partition stats index, simply set `hoodie.metadata.index.partition.stats.enable = 'true'` in create table options.
436443

444+
:::note
445+
1. `column_stats` index is required to be enabled for `partition_stats` index. Both go hand in hand.
446+
2. `partition_stats` index is not created automatically for all columns. Users must specify list of columns for which they want to create partition stats index.
447+
3. `column_stats` and `partition_stats` index is not yet supported for [complex types](https://avro.apache.org/docs/1.11.1/specification/#complex-types).
448+
:::
449+
437450
### Create Secondary Index
438451

439452
Secondary indexes are record level indexes built on any column in the table. It supports multiple records having the same
440453
secondary column value efficiently and is built on top of the existing record level index built on the table's record key.
441454
Secondary indexes are hash based indexes that offer horizontally scalable write performance by splitting key space into shards
442455
by hashing, as well as fast lookups by employing row-based file formats.
443456

444-
:::note
445-
Please note in order to create secondary index:
446-
1. The table must have a primary key and merge mode should be [COMMIT_TIME_ORDERING](/docs/next/record_merger#commit_time_ordering).
447-
2. Record index must be enabled. This can be done by setting `hoodie.metadata.record.index.enable=true` and then creating `record_index`. Please note the example below.
448-
:::
457+
Let us now look at an example of creating a table with multiple indexes and how the query leverage the indexes for both
458+
partition pruning and data skipping.
449459

450460
```sql
451461
DROP TABLE IF EXISTS hudi_table;
@@ -513,24 +523,10 @@ Bloom filter indexes store a bloom filter per file, on the column or column expr
513523
effective in skipping files that don't contain a high cardinality column value e.g. uuids.
514524

515525
```sql
516-
CREATE INDEX idx_bloom_driver ON hudi_indexed_table USING bloom_filters(driver) OPTIONS(expr='identity');
526+
-- Create a bloom filter index on the column derived from expression `lower(rider)` of the table `hudi_table`
517527
CREATE INDEX idx_bloom_rider ON hudi_indexed_table USING bloom_filters(rider) OPTIONS(expr='lower');
518528
```
519529

520-
521-
### Limitations
522-
523-
- Unlike column stats, partition stats index is not created automatically for all columns. Users must specify list of
524-
columns for which they want to create partition stats index.
525-
- Predicate on internal meta fields such as `_hoodie_record_key` or `_hoodie_partition_path` cannot be used for data
526-
skipping. Queries with such predicates cannot leverage the indexes.
527-
- Secondary index is not supported for nested fields.
528-
- Secondary index can be created only if record index is available in the table
529-
- Secondary index can only be used for tables using OverwriteWithLatestAvroPayload payload or COMMIT_TIME_ORDERING merge mode
530-
- Column stats Expression Index can not be created using `identity` expression with SQL. Users can leverage column stat index using Datasource instead.
531-
- Index update can fail with schema evolution.
532-
- Only one index can be created at a time using [async indexer](metadata_indexing).
533-
534530
### Setting Hudi configs
535531

536532
There are different ways you can pass the configs for a given hudi table.

website/docs/sql_dml.md

+8
Original file line numberDiff line numberDiff line change
@@ -212,6 +212,14 @@ SELECT id, name, price, _ts, description FROM tableName;
212212

213213
Notice, instead of `UPDATE SET *`, we are updating only the `price` and `_ts` columns.
214214

215+
:::note
216+
Partial update is not yet supported in the following cases:
217+
1. When the target table is a bootstrapped table.
218+
2. When virtual keys is enabled.
219+
3. When schema on read is enabled.
220+
4. When there is an enum field in the source data.
221+
:::
222+
215223
### Delete From
216224

217225
You can remove data from a Hudi table using the `DELETE FROM` statement.

website/docs/sql_queries.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@ using path filters. We expect that native integration with Spark's optimized tab
3838
management will yield great performance benefits in those versions.
3939
:::
4040

41-
### Snapshot Query without Index Acceleration
41+
### Snapshot Query with Index Acceleration
4242

4343
In this section we would go over the various indexes and how they help in data skipping in Hudi. We will first create
4444
a hudi table without any index.

website/releases/release-1.0.0-beta2.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
title: "Release 1.0.0-beta2"
3-
sidebar_position: 1
3+
sidebar_position: 3
44
layout: releases
55
toc: true
66
---

website/releases/release-1.0.0.md

+172
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,172 @@
1+
---
2+
title: "Release 1.0.0"
3+
sidebar_position: 1
4+
layout: releases
5+
toc: true
6+
---
7+
8+
import Tabs from '@theme/Tabs';
9+
import TabItem from '@theme/TabItem';
10+
11+
## [Release 1.0.0](https://github.com/apache/hudi/releases/tag/release-1.0.0) ([docs](/docs/quick-start-guide))
12+
13+
Apache Hudi 1.0.0 is a major milestone release of Apache Hudi. This release contains significant format changes and new exciting features
14+
as we will see below.
15+
16+
## Migration Guide
17+
18+
We encourage users to try the **1.0.0** features on new tables first. The 1.0 general availability (GA) release will
19+
support automatic table upgrades from 0.x versions while also ensuring full backward compatibility when reading 0.x
20+
Hudi tables using 1.0, ensuring a seamless migration experience.
21+
22+
This release comes with **backward compatible writes** i.e. 1.0.0 can write in both the table version 8 (latest) and older
23+
table version 6 (corresponds to 0.14 & above) formats. Automatic upgrades for tables from 0.x versions are fully
24+
supported, minimizing migration challenges. Until all the readers are upgraded, users can still deploy 1.0.0 binaries
25+
for the writers and leverage backward compatible writes to continue writing the tables in the older format. Once the readers
26+
are fully upgraded, users can switch to the latest format through a config change. We recommend users to follow the upgrade
27+
steps mentioned in the [migration guide](/docs/deployment#upgrading-to-100) to ensure a smooth transition.
28+
29+
:::caution
30+
Most things are seamlessly handled by the auto upgrade process, but there are some limitations. Please read through the
31+
limitations of the upgrade downgrade process before proceeding to migrate. Please check the [migration guide](/docs/deployment#upgrading-to-100)
32+
and [RFC-78](https://github.com/apache/hudi/blob/master/rfc/rfc-78/rfc-78.md#support-matrix-for-different-readers-and-writers) for more details.
33+
:::
34+
35+
## Bundle Updates
36+
37+
- Same bundles supported in the [0.15.0 release](release-0.15.0#new-spark-bundles) are still supported.
38+
- New Flink Bundles to support Flink 1.19 and Flink 1.20.
39+
- In this release, we have deprecated support for Spark 3.2 or lower version in Spark 3.
40+
41+
## Highlights
42+
43+
### Format changes
44+
45+
The main epic covering all the format changes is [HUDI-6242](https://issues.apache.org/jira/browse/HUDI-6242), which is also
46+
covered in the [Hudi 1.0 tech specification](/tech-specs-1point0). The following are the main highlights with respect to format changes:
47+
48+
#### Timeline
49+
50+
- The active and archived timeline dichotomy has been done away with a more scalable LSM tree based
51+
timeline. The timeline layout is now more organized and efficient for time-range queries and scaling to infinite history.
52+
- As a result, timeline layout has been changed, and it has been moved to `.hoodie/timeline` directory under the base
53+
path of the table.
54+
- There are changes to the timeline instant files as well:
55+
- All commit metadata is serialized to Avro, allowing for future compatibility and uniformity in metadata across all
56+
actions.
57+
- Instant files for completed actions now include a completion time.
58+
- Action for the pending clustering instant is now renamed to `clustering` to make it distinct from other
59+
`replacecommit` actions.
60+
61+
#### Log File Format
62+
63+
- In addition to the keys in the log file header, we also store record positions. Refer to the
64+
latest [spec](/tech-specs-1point0#log-format) for more details. This allows us to do position-based merging (apart
65+
from key-based merging) and skip pages based on positions.
66+
- Log file name will now have the deltacommit instant time instead of base commit instant time.
67+
- The new log file format also enables fast partial updates with low storage overhead.
68+
69+
### Compatibility with Old Formats
70+
71+
- **Backward Compatible writes:** Hudi 1.0 writes now support writing in both the table version 8 (latest) and older table version 6 (corresponds to 0.14 & above) formats, ensuring seamless
72+
integration with existing setups.
73+
- **Automatic upgrades**: for tables from 0.x versions are fully supported, minimizing migration challenges. We also recommend users first try migrating to 0.14 &
74+
above, if you have advanced setups with multiple readers/writers/table services.
75+
76+
### Concurrency Control
77+
78+
1.0.0 introduces **Non-Blocking Concurrency Control (NBCC)**, enabling multi-stream concurrent ingestion without
79+
conflict. This is a general-purpose concurrency model aimed at the stream processing or high-contention/frequent writing
80+
scenarios. In contrast to Optimistic Concurrency Control, where writers abort the transaction if there is a hint of
81+
contention, this innovation allows multiple streaming writes to the same Hudi table without any overhead of conflict
82+
resolution, while keeping the semantics of event-time ordering found in streaming systems, along with asynchronous table
83+
service such as compaction, archiving and cleaning.
84+
85+
To learn more about NBCC, refer to [this blog](/blog/2024/12/06/non-blocking-concurrency-control) which also includes a demo with Flink writers.
86+
87+
### New Indices
88+
89+
1.0.0 introduces new indices to the multi-modal indexing subsystem of Apache Hudi. These indices are designed to improve
90+
query performance through partition pruning and further data skipping.
91+
92+
#### Secondary Index
93+
94+
The **secondary index** allows users to create indexes on columns that are not part of record key columns in Hudi
95+
tables. It can be used to speed up queries with predicates on columns other than record key columns.
96+
97+
#### Partition Stats Index
98+
99+
The **partition stats index** aggregates statistics at the partition level for the columns for which it is enabled. This
100+
helps in efficient partition pruning even for non-partition fields.
101+
102+
#### Expression Index
103+
104+
The **expression index** enables efficient queries on columns derived from expressions. It can collect stats on columns
105+
derived from expressions without materializing them, and can be used to speed up queries with filters containing such
106+
expressions.
107+
108+
To learn more about these indices, refer to the [SQL queries](/docs/sql_queries#snapshot-query-with-index-acceleration) docs.
109+
110+
### Partial Updates
111+
112+
1.0.0 extends support for partial updates to Merge-on-Read tables, which allows users to update only a subset of columns
113+
in a record. This feature is useful when users want to update only a few columns in a record without rewriting the
114+
entire record.
115+
116+
To learn more about partial updates, refer to the [SQL DML](/docs/sql_dml#merge-into-partial-update) docs.
117+
118+
### Multiple Base File Formats in a single table
119+
120+
- Support for multiple base file formats (e.g., **Parquet**, **ORC**, **HFile**) within a single Hudi table, allowing
121+
tailored formats for specific use cases like indexing and ML applications.
122+
- It is also useful when users want to switch from one file
123+
format to another, e.g. from ORC to Parquet, without rewriting the whole table.
124+
- **Configuration:** Enable with `hoodie.table.multiple.base.file.formats.enable`.
125+
126+
To learn more about the format changes, refer to the [Hudi 1.0 tech specification](/tech-specs-1point0).
127+
128+
### API Changes
129+
130+
1.0.0 introduces several API changes, including:
131+
132+
#### Record Merger API
133+
134+
`HoodieRecordPayload` interface is deprecated in favor of the new `HoodieRecordMerger` interface. Record merger is a
135+
generic interface that allows users to define custom logic for merging base file and log file records. This release
136+
comes with a few out-of-the-box merge modes, which define how the base and log files are ordered in a file slice and
137+
further how different records with the same record key within that file slice are merged consistently to produce the
138+
same deterministic results for snapshot queries, writers and table services. Specifically, there are three merge modes
139+
supported as a table-level configuration:
140+
141+
- `COMMIT_TIME_ORDERING`: Merging simply picks the record belonging to the latest write (commit time) as the merged
142+
result.
143+
- `EVENT_TIME_ORDERING`: Merging picks the record with the highest value on a user specified ordering or precombine
144+
field as the merged result.
145+
- `CUSTOM`: Users can provide custom merger implementation to have better control over the merge logic.
146+
147+
:::note
148+
Going forward, we recommend users to migrate and use the record merger APIs and not write new payload implementations.
149+
:::
150+
151+
#### Positional Merging with Filegroup Reader
152+
153+
- **Position-Based Merging:** Offers an alternative to key-based merging, allowing for page skipping based on record
154+
positions. Enabled by default for Spark and Hive.
155+
- **Configuration:** Activate positional merging using `hoodie.merge.use.record.positions=true`.
156+
157+
The new reader has shown impressive performance gains for **partial updates** with key-based merging. For a MOR table of
158+
size 1TB with 100 partitions and 80% random updates in subsequent commits, the new reader is **5.7x faster** for
159+
snapshot queries with **70x reduced write amplification**.
160+
161+
### Flink Enhancements
162+
163+
- **Lookup Joins:** Flink now supports lookup joins, enabling table enrichment with external data sources.
164+
- **Partition Stats Index Support:** As mentioned above, partition stats support is now available for Flink, bringing
165+
efficient partition pruning to streaming workloads.
166+
- **Non-Blocking Concurrency Control:** NBCC is now available for Flink streaming writers, allowing for multi-stream
167+
concurrent ingestion without conflict.
168+
169+
## Call to Action
170+
171+
The 1.0.0 GA release is the culmination of extensive development, testing, and feedback. We invite you to upgrade and
172+
experience the new features and enhancements.

0 commit comments

Comments
 (0)