[docs] add docs to integrate Fluss + Iceberg via Flink with AWS Glue and Hive#3424
[docs] add docs to integrate Fluss + Iceberg via Flink with AWS Glue and Hive#3424qzyu999 wants to merge 2 commits into
Conversation
…Data Lake Catalogs section which includes Lakekeeper
There was a problem hiding this comment.
Pull request overview
Adds new documentation pages describing how to integrate Fluss tiering to Apache Iceberg when using AWS Glue or Hive Metastore as the Iceberg catalog, and cross-links these new guides from the main Iceberg integration doc.
Changes:
- Add an AWS Glue catalog integration guide (IAM policy, required JARs,
server.yamlconfig, tiering job launch, Athena verification). - Add a Hive Metastore catalog integration guide (HMS connection, required JARs/Hadoop classpath,
server.yamlconfig, tiering job launch, Spark verification). - Update the Iceberg format guide to link to the new catalog-specific pages for
hiveandglue(and to Lakekeeper forrest).
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| website/docs/streaming-lakehouse/integrate-data-lakes/formats/iceberg.md | Adds catalog-specific cross-links for supported Iceberg catalog types. |
| website/docs/streaming-lakehouse/integrate-data-lakes/catalogs/glue.md | New end-to-end guide for using AWS Glue Data Catalog with Iceberg tiering. |
| website/docs/streaming-lakehouse/integrate-data-lakes/catalogs/hive.md | New end-to-end guide for using Hive Metastore with Iceberg tiering. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| "arn:aws:glue:<region>:<account-id>:catalog", | ||
| "arn:aws:glue:<region>:<account-id>:database/*", | ||
| "arn:aws:glue:<region>:<account-id>:table/*" | ||
| ] |
| :::note | ||
| If your Hive warehouse is located on cloud object storage (like Amazon S3 or Aliyun OSS), set `datalake.iceberg.warehouse` to the corresponding cloud URI (e.g., `s3://<your-bucket>/warehouse`) and configure the required filesystem integration. See [AWS Glue](glue.md) for AWS credentials setup. | ||
| ::: |
| 1. Fluss manages Iceberg databases and tables via HMS thrift API. | ||
| 2. The [tiering service](maintenance/tiered-storage/lakehouse-storage.md#start-the-datalake-tiering-service) writes parquet data files to HDFS (or S3/OSS) and commits table snapshots via the Hive Metastore client. |
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Hi @luoyuxia, as mentioned in the PR description, no it isn't currently:
I spoke with @leekeiabstraction to ask about this and he mentioned that I should be getting it tested E2E, as I wasn't sure whether there were other developers who have done these steps themselves. I am currently working on the AWS Glue steps and will update accordingly. Thank you for your patience! |
Purpose
Linked issue: close #2616
This pull request introduces comprehensive integration guides for using AWS Glue and Hive Metastore catalogs when tiering Fluss streaming data to Apache Iceberg. This completes the Iceberg Data Lake Catalogs documentation suite under
docs/streaming-lakehouse/integrate-data-lakes/catalogs/.Brief change log
website/docs/streaming-lakehouse/integrate-data-lakes/catalogs/glue.md): Documents AWS IAM policy template, required catalog runtime JAR dependencies,server.yamlcluster configurations, Flink tiering service commands, and Amazon Athena query verification.website/docs/streaming-lakehouse/integrate-data-lakes/catalogs/hive.md): Documents Hive Metastore Thrift connection options, required Hadoop client and Hive runtime classpath dependencies,HADOOP_CLASSPATHconfiguration, Flink tiering commands, and Spark SQL query verification.website/docs/streaming-lakehouse/integrate-data-lakes/formats/iceberg.md): Added catalog-specific cross-links forhive(linking to Hive Metastore),glue(linking to AWS Glue), andrest(linking to Lakekeeper).Note: Changes are based on the existing
lakekeeper.mdas a template, and references were based on existing code and online/offline documentation. The actual AWS Glue/HMS implementations have not yet been tested by the developer.Tests
npm run buildto verify the page output and ensure there are no broken links (meeting Docusaurus build validation).http://localhost:3000/docs/next/streaming-lakehouse/integrate-data-lakes/formats/iceberg/.API and Format
This is a documentation-only change. It does not affect any public API or storage formats.
Documentation
This pull request introduces new documentation guides under the
Docusauruswebsite subfolder. No changes were made to code-level Javadocs.