[DOCS] Add release highlights for 1.0.0 release (#12475)

codope · vinothchandar · web-flow · commit 7529570682e1 · 2024-12-13T09:07:02.000+05:30
* Add release highlights for 1.0.0 release

* Code Review comments for release-1.0.0.md

* Fix links and address review comments wrt upgrading

* Add limitations

---------

Co-authored-by: vinoth chandar &lt;vinothchandar@users.noreply.github.com&gt;
diff --git a/website/docs/concurrency_control.md b/website/docs/concurrency_control.md
@@ -214,6 +214,11 @@ currently available for preview in version 1.0.0-beta only with the caveat that
 between clustering and ingestion. It works for compaction and ingestion, and we can see an example of that with Flink
 writers [here](sql_dml#non-blocking-concurrency-control-experimental).
 
+:::note
+`NON_BLOCKING_CONCURRENCY_CONTROL` between ingestion writer and table service writer is not yet supported for clustering.
+Please use `OPTIMISTIC_CONCURRENCY_CONTROL` for clustering.
+:::
+
 ## Early conflict Detection
 
 Multi writing using OCC allows multiple writers to concurrently write and atomically commit to the Hudi table if there is no overlapping data file to be written, to guarantee data consistency, integrity and correctness. Prior to 0.13.0 release, as the OCC (optimistic concurrency control) name suggests, each writer will optimistically proceed with ingestion and towards the end, just before committing will go about conflict resolution flow to deduce overlapping writes and abort one if need be. But this could result in lot of compute waste, since the aborted commit will have to retry from beginning. With 0.13.0, Hudi introduced early conflict deduction leveraging markers in hudi to deduce the conflicts eagerly and abort early in the write lifecyle instead of doing it in the end. For large scale deployments, this might avoid wasting lot o compute resources if there could be overlapping concurrent writers.
diff --git a/website/docs/deployment.md b/website/docs/deployment.md
@@ -165,6 +165,32 @@ As general guidelines,
 
 Note that release notes can override this information with specific instructions, applicable on case-by-case basis.
 
+### Upgrading to 1.0.0
+
+1.0.0 is a major release with significant format changes. To ensure a smooth migration experience, we recommend the
+following steps:
+
+1. Stop any async table services in 0.x completely.
+2. Upgrade writers to 1.x with table version (tv) 6, `autoUpgrade` and metadata disabled (this won't auto-upgrade anything);
+   0.x readers will continue to work; writers can also be readers and will continue to read both tv=6.
+   a. Set `hoodie.write.auto.upgrade` to false.
+   b. Set `hoodie.metadata.enable` to false.
+3. Upgrade table services to 1.x with tv=6, and resume operations.
+4. Upgrade all remaining readers to 1.x, with tv=6.
+5. Redeploy writers with tv=8; table services and readers will adapt/pick up tv=8 on the fly.
+6. Once all readers and writers are in 1.x, we are good to enable any new features, including metadata, with 1.x tables.
+
+During the upgrade, metadata table will not be updated and it will be behind the data table. It is important to note
+that metadata table will be updated only when the writer is upgraded to tv=8. So, even the readers should keep metadata
+disabled during rolling upgrade until all writers are upgraded to tv=8.
+
+:::caution
+Most things are seamlessly handled by the auto upgrade process, but there are some limitations. Please read through the
+limitations of the upgrade downgrade process before proceeding to migrate. Please
+check [RFC-78](https://github.com/apache/hudi/blob/master/rfc/rfc-78/rfc-78.md#support-matrix-for-different-readers-and-writers)
+for more details.
+:::
+
 ## Downgrading
 
 Upgrade is automatic whenever a new Hudi version is used whereas downgrade is a manual step. We need to use the Hudi
diff --git a/website/docs/sql_ddl.md b/website/docs/sql_ddl.md
@@ -272,6 +272,7 @@ Both index and column on which the index is created can be qualified with some o
 Please note in order to create secondary index:
 1. The table must have a primary key and merge mode should be [COMMIT_TIME_ORDERING](/docs/next/record_merger#commit_time_ordering).
 2. Record index must be enabled. This can be done by setting `hoodie.metadata.record.index.enable=true` and then creating `record_index`. Please note the example below.
+3. Secondary index is not supported for [complex types](https://avro.apache.org/docs/1.11.1/specification/#complex-types).
 :::
 
 **Examples**
@@ -334,12 +335,18 @@ date based partitioning, provide same benefits to queries, even if the physical
 CREATE INDEX IF NOT EXISTS ts_datestr ON hudi_table 
   USING column_stats(ts) 
   OPTIONS(expr='from_unixtime', format='yyyy-MM-dd');
--- Create a expression index on the column `ts` (timestamp in yyyy-MM-dd HH:mm:ss) of the table `hudi_table` using the function `hour`
+-- Create an expression index on the column `ts` (timestamp in yyyy-MM-dd HH:mm:ss) of the table `hudi_table` using the function `hour`
 CREATE INDEX ts_hour ON hudi_table 
   USING column_stats(ts) 
   options(expr='hour');
 ```
 
+:::note
+1. Expression index can only be created for Spark engine using SQL. It is not supported yet with Spark DataSource API.
+2. Expression index is not yet supported for [complex types](https://avro.apache.org/docs/1.11.1/specification/#complex-types).
+3. Expression index is supported for unary and certain binary expressions. Please check [SQL DDL docs](sql_ddl#create-expression-index) for more details.
+   :::
+
 The `expr` option is required for creating expression index, and it should be a valid Spark SQL function. Please check the syntax 
 for the above functions in the [Spark SQL documentation](https://spark.apache.org/docs/latest/sql-ref-functions.html) and provide the options accordingly. For example, 
 the `format` option is required for `from_unixtime` function.
@@ -434,18 +441,21 @@ and execution.
 
 To enable partition stats index, simply set `hoodie.metadata.index.partition.stats.enable = 'true'` in create table options.
 
+:::note
+1. `column_stats` index is required to be enabled for `partition_stats` index. Both go hand in hand. 
+2. `partition_stats` index is not created automatically for all columns. Users must specify list of columns for which they want to create partition stats index.
+3. `column_stats` and `partition_stats` index is not yet supported for [complex types](https://avro.apache.org/docs/1.11.1/specification/#complex-types).
+:::
+
 ### Create Secondary Index
 
 Secondary indexes are record level indexes built on any column in the table. It supports multiple records having the same
 secondary column value efficiently and is built on top of the existing record level index built on the table's record key.
 Secondary indexes are hash based indexes that offer horizontally scalable write performance by splitting key space into shards 
 by hashing, as well as fast lookups by employing row-based file formats.
 
-:::note
-Please note in order to create secondary index:
-1. The table must have a primary key and merge mode should be [COMMIT_TIME_ORDERING](/docs/next/record_merger#commit_time_ordering).
-2. Record index must be enabled. This can be done by setting `hoodie.metadata.record.index.enable=true` and then creating `record_index`. Please note the example below.
-:::
+Let us now look at an example of creating a table with multiple indexes and how the query leverage the indexes for both
+partition pruning and data skipping.
 
 ```sql
 DROP TABLE IF EXISTS hudi_table;
@@ -513,24 +523,10 @@ Bloom filter indexes store a bloom filter per file, on the column or column expr
 effective in skipping files that don't contain a high cardinality column value e.g. uuids.
 
 ```sql
-CREATE INDEX idx_bloom_driver ON hudi_indexed_table USING bloom_filters(driver) OPTIONS(expr='identity');
+-- Create a bloom filter index on the column derived from expression `lower(rider)` of the table `hudi_table`
 CREATE INDEX idx_bloom_rider ON hudi_indexed_table USING bloom_filters(rider) OPTIONS(expr='lower');
 ```
 
-
-### Limitations 
-
-- Unlike column stats, partition stats index is not created automatically for all columns. Users must specify list of
-  columns for which they want to create partition stats index.
-- Predicate on internal meta fields such as `_hoodie_record_key` or `_hoodie_partition_path` cannot be used for data
-  skipping. Queries with such predicates cannot leverage the indexes.
-- Secondary index is not supported for nested fields.
-- Secondary index can be created only if record index is available in the table
-- Secondary index can only be used for tables using OverwriteWithLatestAvroPayload payload or COMMIT_TIME_ORDERING merge mode 
-- Column stats Expression Index can not be created using `identity` expression with SQL. Users can leverage column stat index using Datasource instead.
-- Index update can fail with schema evolution.
-- Only one index can be created at a time using [async indexer](metadata_indexing).
-
 ### Setting Hudi configs 
 
 There are different ways you can pass the configs for a given hudi table. 
diff --git a/website/docs/sql_dml.md b/website/docs/sql_dml.md
@@ -212,6 +212,14 @@ SELECT id, name, price, _ts, description FROM tableName;
 
 Notice, instead of `UPDATE SET *`, we are updating only the `price` and `_ts` columns.
 
+:::note
+Partial update is not yet supported in the following cases:
+1. When the target table is a bootstrapped table. 
+2. When virtual keys is enabled.
+3. When schema on read is enabled. 
+4. When there is an enum field in the source data.
+:::
+
 ### Delete From
 
 You can remove data from a Hudi table using the `DELETE FROM` statement.
diff --git a/website/docs/sql_queries.md b/website/docs/sql_queries.md
@@ -38,7 +38,7 @@ using path filters. We expect that native integration with Spark's optimized tab
 management will yield great performance benefits in those versions.
 :::
 
-### Snapshot Query without Index Acceleration
+### Snapshot Query with Index Acceleration
 
 In this section we would go over the various indexes and how they help in data skipping in Hudi. We will first create
 a hudi table without any index.
diff --git a/website/releases/release-1.0.0-beta2.md b/website/releases/release-1.0.0-beta2.md
@@ -1,6 +1,6 @@
 ---
 title: "Release 1.0.0-beta2"
-sidebar_position: 1
+sidebar_position: 3
 layout: releases
 toc: true
 ---
diff --git a/website/releases/release-1.0.0.md b/website/releases/release-1.0.0.md
@@ -0,0 +1,172 @@
+---
+title: "Release 1.0.0"
+sidebar_position: 1
+layout: releases
+toc: true
+---
+
+import Tabs from '@theme/Tabs';
+import TabItem from '@theme/TabItem';
+
+## [Release 1.0.0](https://github.com/apache/hudi/releases/tag/release-1.0.0) ([docs](/docs/quick-start-guide))
+
+Apache Hudi 1.0.0 is a major milestone release of Apache Hudi. This release contains significant format changes and new exciting features 
+as we will see below.
+
+## Migration Guide
+
+We encourage users to try the **1.0.0** features on new tables first. The 1.0 general availability (GA) release will
+support automatic table upgrades from 0.x versions while also ensuring full backward compatibility when reading 0.x
+Hudi tables using 1.0, ensuring a seamless migration experience. 
+
+This release comes with **backward compatible writes** i.e. 1.0.0 can write in both the table version 8 (latest) and older
+table version 6 (corresponds to 0.14 & above) formats. Automatic upgrades for tables from 0.x versions are fully
+supported, minimizing migration challenges. Until all the readers are upgraded, users can still deploy 1.0.0 binaries 
+for the writers and leverage backward compatible writes to continue writing the tables in the older format. Once the readers
+are fully upgraded, users can switch to the latest format through a config change. We recommend users to follow the upgrade 
+steps mentioned in the [migration guide](/docs/deployment#upgrading-to-100) to ensure a smooth transition.
+
+:::caution
+Most things are seamlessly handled by the auto upgrade process, but there are some limitations. Please read through the 
+limitations of the upgrade downgrade process before proceeding to migrate. Please check the [migration guide](/docs/deployment#upgrading-to-100) 
+and [RFC-78](https://github.com/apache/hudi/blob/master/rfc/rfc-78/rfc-78.md#support-matrix-for-different-readers-and-writers) for more details.
+:::
+
+## Bundle Updates
+
+ - Same bundles supported in the [0.15.0 release](release-0.15.0#new-spark-bundles) are still supported.
+ - New Flink Bundles to support Flink 1.19 and Flink 1.20.
+ - In this release, we have deprecated support for Spark 3.2 or lower version in Spark 3.
+
+## Highlights
+
+### Format changes
+
+The main epic covering all the format changes is [HUDI-6242](https://issues.apache.org/jira/browse/HUDI-6242), which is also 
+covered in the [Hudi 1.0 tech specification](/tech-specs-1point0). The following are the main highlights with respect to format changes:
+
+#### Timeline
+
+- The active and archived timeline dichotomy has been done away with a more scalable LSM tree based
+  timeline. The timeline layout is now more organized and efficient for time-range queries and scaling to infinite history.
+- As a result, timeline layout has been changed, and it has been moved to `.hoodie/timeline` directory under the base
+  path of the table.
+- There are changes to the timeline instant files as well:
+    - All commit metadata is serialized to Avro, allowing for future compatibility and uniformity in metadata across all
+      actions.
+    - Instant files for completed actions now include a completion time.
+    - Action for the pending clustering instant is now renamed to `clustering` to make it distinct from other
+      `replacecommit` actions.
+
+#### Log File Format
+
+- In addition to the keys in the log file header, we also store record positions. Refer to the
+  latest [spec](/tech-specs-1point0#log-format) for more details. This allows us to do position-based merging (apart
+  from key-based merging) and skip pages based on positions.
+- Log file name will now have the deltacommit instant time instead of base commit instant time.
+- The new log file format also enables fast partial updates with low storage overhead.
+
+### Compatibility with Old Formats
+
+- **Backward Compatible writes:** Hudi 1.0 writes now support writing in both the table version 8 (latest) and older table version 6 (corresponds to 0.14 & above) formats, ensuring seamless
+  integration with existing setups.
+- **Automatic upgrades**: for tables from 0.x versions are fully supported, minimizing migration challenges. We also recommend users first try migrating to 0.14 &
+  above, if you have advanced setups with multiple readers/writers/table services.
+
+### Concurrency Control
+
+1.0.0 introduces **Non-Blocking Concurrency Control (NBCC)**, enabling multi-stream concurrent ingestion without
+conflict. This is a general-purpose concurrency model aimed at the stream processing or high-contention/frequent writing
+scenarios. In contrast to Optimistic Concurrency Control, where writers abort the transaction if there is a hint of
+contention, this innovation allows multiple streaming writes to the same Hudi table without any overhead of conflict
+resolution, while keeping the semantics of event-time ordering found in streaming systems, along with asynchronous table
+service such as compaction, archiving and cleaning.
+
+To learn more about NBCC, refer to [this blog](/blog/2024/12/06/non-blocking-concurrency-control) which also includes a demo with Flink writers.
+
+### New Indices
+
+1.0.0 introduces new indices to the multi-modal indexing subsystem of Apache Hudi. These indices are designed to improve
+query performance through partition pruning and further data skipping.
+
+#### Secondary Index
+
+The **secondary index** allows users to create indexes on columns that are not part of record key columns in Hudi
+tables. It can be used to speed up queries with predicates on columns other than record key columns.
+
+#### Partition Stats Index
+
+The **partition stats index** aggregates statistics at the partition level for the columns for which it is enabled. This
+helps in efficient partition pruning even for non-partition fields.
+
+#### Expression Index
+
+The **expression index** enables efficient queries on columns derived from expressions. It can collect stats on columns
+derived from expressions without materializing them, and can be used to speed up queries with filters containing such
+expressions.
+
+To learn more about these indices, refer to the [SQL queries](/docs/sql_queries#snapshot-query-with-index-acceleration) docs.
+
+### Partial Updates
+
+1.0.0 extends support for partial updates to Merge-on-Read tables, which allows users to update only a subset of columns
+in a record. This feature is useful when users want to update only a few columns in a record without rewriting the
+entire record.
+
+To learn more about partial updates, refer to the [SQL DML](/docs/sql_dml#merge-into-partial-update) docs.
+
+### Multiple Base File Formats in a single table
+
+- Support for multiple base file formats (e.g., **Parquet**, **ORC**, **HFile**) within a single Hudi table, allowing
+  tailored formats for specific use cases like indexing and ML applications.
+- It is also useful when users want to switch from one file
+  format to another, e.g. from ORC to Parquet, without rewriting the whole table.
+- **Configuration:** Enable with `hoodie.table.multiple.base.file.formats.enable`.
+
+To learn more about the format changes, refer to the [Hudi 1.0 tech specification](/tech-specs-1point0).
+
+### API Changes
+
+1.0.0 introduces several API changes, including:
+
+#### Record Merger API
+
+`HoodieRecordPayload` interface is deprecated in favor of the new `HoodieRecordMerger` interface. Record merger is a
+generic interface that allows users to define custom logic for merging base file and log file records. This release
+comes with a few out-of-the-box merge modes, which define how the base and log files are ordered in a file slice and
+further how different records with the same record key within that file slice are merged consistently to produce the
+same deterministic results for snapshot queries, writers and table services. Specifically, there are three merge modes
+supported as a table-level configuration:
+
+- `COMMIT_TIME_ORDERING`: Merging simply picks the record belonging to the latest write (commit time) as the merged
+  result.
+- `EVENT_TIME_ORDERING`: Merging picks the record with the highest value on a user specified ordering or precombine
+  field as the merged result.
+- `CUSTOM`: Users can provide custom merger implementation to have better control over the merge logic.
+
+:::note
+Going forward, we recommend users to migrate and use the record merger APIs and not write new payload implementations.
+:::
+
+#### Positional Merging with Filegroup Reader
+
+- **Position-Based Merging:** Offers an alternative to key-based merging, allowing for page skipping based on record
+  positions. Enabled by default for Spark and Hive.
+- **Configuration:** Activate positional merging using `hoodie.merge.use.record.positions=true`.
+
+The new reader has shown impressive performance gains for **partial updates** with key-based merging. For a MOR table of
+size 1TB with 100 partitions and 80% random updates in subsequent commits, the new reader is **5.7x faster** for
+snapshot queries with **70x reduced write amplification**.
+
+### Flink Enhancements
+
+- **Lookup Joins:** Flink now supports lookup joins, enabling table enrichment with external data sources.
+- **Partition Stats Index Support:** As mentioned above, partition stats support is now available for Flink, bringing
+  efficient partition pruning to streaming workloads.
+- **Non-Blocking Concurrency Control:** NBCC is now available for Flink streaming writers, allowing for multi-stream
+  concurrent ingestion without conflict.
+
+## Call to Action
+
+The 1.0.0 GA release is the culmination of extensive development, testing, and feedback. We invite you to upgrade and
+experience the new features and enhancements.