Skip to content

[docs] add docs to integrate Fluss + Iceberg via Flink with AWS Glue and Hive#3424

Open
qzyu999 wants to merge 2 commits into
apache:mainfrom
qzyu999:issue-2616
Open

[docs] add docs to integrate Fluss + Iceberg via Flink with AWS Glue and Hive#3424
qzyu999 wants to merge 2 commits into
apache:mainfrom
qzyu999:issue-2616

Conversation

@qzyu999

@qzyu999 qzyu999 commented Jun 3, 2026

Copy link
Copy Markdown

Purpose

Linked issue: close #2616

This pull request introduces comprehensive integration guides for using AWS Glue and Hive Metastore catalogs when tiering Fluss streaming data to Apache Iceberg. This completes the Iceberg Data Lake Catalogs documentation suite under docs/streaming-lakehouse/integrate-data-lakes/catalogs/.

Brief change log

  • Added AWS Glue Catalog integration guide (website/docs/streaming-lakehouse/integrate-data-lakes/catalogs/glue.md): Documents AWS IAM policy template, required catalog runtime JAR dependencies, server.yaml cluster configurations, Flink tiering service commands, and Amazon Athena query verification.
  • Added Hive Metastore integration guide (website/docs/streaming-lakehouse/integrate-data-lakes/catalogs/hive.md): Documents Hive Metastore Thrift connection options, required Hadoop client and Hive runtime classpath dependencies, HADOOP_CLASSPATH configuration, Flink tiering commands, and Spark SQL query verification.
  • Updated main Iceberg integration guide (website/docs/streaming-lakehouse/integrate-data-lakes/formats/iceberg.md): Added catalog-specific cross-links for hive (linking to Hive Metastore), glue (linking to AWS Glue), and rest (linking to Lakekeeper).

Note: Changes are based on the existing lakekeeper.md as a template, and references were based on existing code and online/offline documentation. The actual AWS Glue/HMS implementations have not yet been tested by the developer.

Tests

  • Built the entire documentation site locally using npm run build to verify the page output and ensure there are no broken links (meeting Docusaurus build validation).
  • Verified the rendering, layouts, and link references using the local development server at http://localhost:3000/docs/next/streaming-lakehouse/integrate-data-lakes/formats/iceberg/.

API and Format

This is a documentation-only change. It does not affect any public API or storage formats.

Documentation

This pull request introduces new documentation guides under the Docusaurus website subfolder. No changes were made to code-level Javadocs.

  • Generative AI disclosure:
    • Yes (Antigravity AI Assistant, reviewed by human developer)

…Data Lake Catalogs section which includes Lakekeeper

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds new documentation pages describing how to integrate Fluss tiering to Apache Iceberg when using AWS Glue or Hive Metastore as the Iceberg catalog, and cross-links these new guides from the main Iceberg integration doc.

Changes:

  • Add an AWS Glue catalog integration guide (IAM policy, required JARs, server.yaml config, tiering job launch, Athena verification).
  • Add a Hive Metastore catalog integration guide (HMS connection, required JARs/Hadoop classpath, server.yaml config, tiering job launch, Spark verification).
  • Update the Iceberg format guide to link to the new catalog-specific pages for hive and glue (and to Lakekeeper for rest).

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
website/docs/streaming-lakehouse/integrate-data-lakes/formats/iceberg.md Adds catalog-specific cross-links for supported Iceberg catalog types.
website/docs/streaming-lakehouse/integrate-data-lakes/catalogs/glue.md New end-to-end guide for using AWS Glue Data Catalog with Iceberg tiering.
website/docs/streaming-lakehouse/integrate-data-lakes/catalogs/hive.md New end-to-end guide for using Hive Metastore with Iceberg tiering.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +47 to +50
"arn:aws:glue:<region>:<account-id>:catalog",
"arn:aws:glue:<region>:<account-id>:database/*",
"arn:aws:glue:<region>:<account-id>:table/*"
]
Comment on lines +72 to +74
:::note
If your Hive warehouse is located on cloud object storage (like Amazon S3 or Aliyun OSS), set `datalake.iceberg.warehouse` to the corresponding cloud URI (e.g., `s3://<your-bucket>/warehouse`) and configure the required filesystem integration. See [AWS Glue](glue.md) for AWS credentials setup.
:::
Comment on lines +18 to +19
1. Fluss manages Iceberg databases and tables via HMS thrift API.
2. The [tiering service](maintenance/tiered-storage/lakehouse-storage.md#start-the-datalake-tiering-service) writes parquet data files to HDFS (or S3/OSS) and commits table snapshots via the Hive Metastore client.
Comment thread website/docs/streaming-lakehouse/integrate-data-lakes/catalogs/glue.md Outdated

@luoyuxia luoyuxia left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qzyu999 Thanks for the pr. LGTM overall. Just one question, have you ever verify these catalogs?

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
@qzyu999

qzyu999 commented Jun 10, 2026

Copy link
Copy Markdown
Author

@qzyu999 Thanks for the pr. LGTM overall. Just one question, have you ever verify these catalogs?

Hi @luoyuxia, as mentioned in the PR description, no it isn't currently:

Note: Changes are based on the existing lakekeeper.md as a template, and references were based on existing code and online/offline documentation. The actual AWS Glue/HMS implementations have not yet been tested by the developer.

I spoke with @leekeiabstraction to ask about this and he mentioned that I should be getting it tested E2E, as I wasn't sure whether there were other developers who have done these steps themselves. I am currently working on the AWS Glue steps and will update accordingly. Thank you for your patience!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Lake/Iceberg] create doc to integrate fluss + Iceberg via flink with aws glue

3 participants