Initial integration for hudi tables within Polaris #1862

rahil-c · 2025-06-11T17:04:39Z

Motivation

The Polaris Spark client currently supports Iceberg and Delta table. This PR aims to add support for Apache Hudi tables as Generic Tables.

Current behavior

Currently, the Polaris Spark client routes Iceberg table requests to Iceberg REST endpoints and Delta table requests to
Generic Table REST endpoints. This PR aims to allow Hudi to follow a similar support as for what was done for Delta in Polaris.

Desired Behavior

Enable basic Hudi table operations through the Polaris Spark catalog by:

Adding Hudi table detection and routing logic

Changes Included

Core Implementation: Added HudiHelper utility class and enhanced PolarisCatalogUtils with Hudi-specific table loading
logic
Catalog Integration: Modified SparkCatalog to detect and route Hudi table operations appropriately
Testing: Added unit tests for testingHudi integration
Documentation: Updated README with Hudi development support

Special note

Will follow up in another pr for integration and regression testing as they will need to consume the latest hudi point release artifact, once some changes in hudi land.

dimas-b · 2025-06-11T17:09:20Z

Thanks for you contribution, @rahil-c ! Would you mind opening a discussion for this feature on [email protected]?

rahil-c · 2025-06-13T15:28:33Z

Thanks @dimas-b will do so! Have raised a email on dev list here: https://lists.apache.org/thread/66d39oqkc412kk262gy80bm723r9xmpm

rahil-c · 2025-07-01T15:38:05Z

cc @flyrain @gh-yzou @singhpk234

Copilot

Pull Request Overview

This PR introduces initial support for Hudi tables in the Polaris Spark catalog, enabling Hudi create/load operations alongside existing formats.

Extended parameterized tests to cover the new “hudi” format.
Added HudiHelper and HudiCatalogUtils for Hudi-specific catalog loading and namespace synchronization.
Updated SparkCatalog, PolarisCatalogUtils, and build configurations to wire in Hudi dependencies and behavior.

Reviewed Changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
plugins/spark/v3.5/spark/src/test/java/.../DeserializationTest.java	Updated parameterized tests to accept and assert on `format`
plugins/spark/v3.5/spark/src/test/java/.../SparkCatalogTest.java	Added static mocks and new Hudi namespace/table tests
plugins/spark/v3.5/spark/src/test/java/.../NoopHudiCatalog.java	Created a no-op Hudi catalog stub for tests
plugins/spark/v3.5/spark/src/main/java/.../PolarisCatalogUtils.java	Introduced `useHudi`, `isHudiExtensionEnabled`, Hudi load support, SQL builders
plugins/spark/v3.5/spark/src/main/java/.../HudiHelper.java	New helper for instantiating and delegating to Hudi Catalog
plugins/spark/v3.5/spark/src/main/java/.../HudiCatalogUtils.java	New utility for syncing namespace operations via SQL
plugins/spark/v3.5/spark/src/main/java/.../SparkCatalog.java	Routed create/alter/drop to Hudi catalog when appropriate
plugins/spark/v3.5/spark/src/main/java/.../PolarisSparkCatalog.java	Adjusted calls to pass `Identifier` through Hudi load API
plugins/spark/v3.5/spark/build.gradle.kts	Added Hudi dependencies and exclusions
plugins/spark/v3.5/integration/.../logback.xml	Enabled Hudi loggers for integration tests
plugins/spark/v3.5/integration/.../SparkHudiIT.java	New integration tests for basic and unsupported Hudi ops
plugins/spark/v3.5/integration/build.gradle.kts	Added Hive and Hudi bundles to integration dependencies

Comments suppressed due to low confidence (1)

plugins/spark/v3.5/spark/src/test/java/org/apache/polaris/spark/SparkCatalogTest.java

plugins/spark/v3.5/spark/build.gradle.kts

plugins/spark/v3.5/spark/src/main/java/org/apache/polaris/spark/utils/HudiCatalogUtils.java

plugins/spark/v3.5/integration/build.gradle.kts

plugins/spark/v3.5/integration/src/intTest/resources/logback.xml

plugins/spark/v3.5/spark/build.gradle.kts

plugins/spark/v3.5/spark/src/main/java/org/apache/polaris/spark/SparkCatalog.java

plugins/spark/v3.5/spark/src/main/java/org/apache/polaris/spark/utils/HudiCatalogUtils.java

plugins/spark/v3.5/spark/src/main/java/org/apache/polaris/spark/utils/PolarisCatalogUtils.java

plugins/spark/v3.5/spark/src/main/java/org/apache/polaris/spark/SparkCatalog.java

...spark/v3.5/integration/src/intTest/java/org/apache/polaris/spark/quarkus/it/SparkHudiIT.java

plugins/spark/v3.5/spark/build.gradle.kts

plugins/spark/v3.5/spark/src/main/java/org/apache/polaris/spark/utils/HudiCatalogUtils.java

plugins/spark/v3.5/spark/src/main/java/org/apache/polaris/spark/utils/PolarisCatalogUtils.java

plugins/spark/v3.5/spark/src/main/java/org/apache/polaris/spark/PolarisSparkCatalog.java

...spark/v3.5/integration/src/intTest/java/org/apache/polaris/spark/quarkus/it/SparkHudiIT.java

plugins/spark/v3.5/spark/build.gradle.kts

plugins/spark/v3.5/spark/src/main/java/org/apache/polaris/spark/SparkCatalog.java

plugins/spark/v3.5/spark/src/main/java/org/apache/polaris/spark/PolarisSparkCatalog.java

plugins/spark/v3.5/spark/src/main/java/org/apache/polaris/spark/SparkCatalog.java

gh-yzou · 2025-07-08T20:31:25Z

@rahil-c sorry, i made my comment yesterday, but forgot to push it. I did a push, and added some more comments, please let me know if you have more questions about this!
As we have discussed, there are two main concerns for this PR:

the hudi dependency introduced for spark client, which is caused by the usage of HoodieInternalV2Table. This can be resolved by loading V1Table, and then let HudiCatalog loadTable to handle the final table result https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/catalog/HoodieCatalog.scala#L123
the extra namespace creation for HudiCatalog. Polaris Spark Client reuses the whole Iceberg namespace, ideally we do not want to maintain extra namespace creation just for specific table format. The needs of extra namespace creation is because HudiCatalog only works with SparkSession Catalog and HiveCatalog today https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/command/CreateHoodieTableCommand.scala#L198, however, since Polaris is rest catalog, this will not work anymore. We want to see if we can push forward on hudi community to improve the catalog implementation regarding to the third party catalog plugin. Similar as Delta did a special case for unity catalog here https://github.com/delta-io/delta/blob/2d89954008b6c53e49744f09435136c5c63b9f2c/spark/src/main/scala/org/apache/spark/sql/delta/catalog/DeltaCatalog.scala#L218

settings.gradle.kts

rahil-c · 2025-07-23T08:21:46Z

@rahil-c sorry, i made my comment yesterday, but forgot to push it. I did a push, and added some more comments, please let me know if you have more questions about this! As we have discussed, there are two main concerns for this PR:

the hudi dependency introduced for spark client, which is caused by the usage of HoodieInternalV2Table. This can be resolved by loading V1Table, and then let HudiCatalog loadTable to handle the final table result https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/catalog/HoodieCatalog.scala#L123

the extra namespace creation for HudiCatalog. Polaris Spark Client reuses the whole Iceberg namespace, ideally we do not want to maintain extra namespace creation just for specific table format. The needs of extra namespace creation is because HudiCatalog only works with SparkSession Catalog and HiveCatalog today https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/command/CreateHoodieTableCommand.scala#L198, however, since Polaris is rest catalog, this will not work anymore. We want to see if we can push forward on hudi community to improve the catalog implementation regarding to the third party catalog plugin. Similar as Delta did a special case for unity catalog here https://github.com/delta-io/delta/blob/2d89954008b6c53e49744f09435136c5c63b9f2c/spark/src/main/scala/org/apache/spark/sql/delta/catalog/DeltaCatalog.scala#L218

Thanks @gh-yzou, I have followed the recommendations above and updated the pr. Let me know if the approach looks good to you. If so then I can try to break this down into smaller prs.

plugins/spark/v3.5/integration/build.gradle.kts

plugins/spark/v3.5/spark/src/main/java/org/apache/polaris/spark/utils/PolarisCatalogUtils.java

plugins/spark/v3.5/spark/src/main/java/org/apache/polaris/spark/SparkCatalog.java

plugins/spark/v3.5/spark/src/main/java/org/apache/polaris/spark/utils/PolarisCatalogUtils.java

gh-yzou · 2025-07-28T05:01:27Z

plugins/spark/README.md

+
+### Hudi Support
+Currently support for Hudi tables within the Polaris catalog is still under development. 
+The Hudi community has made a change to integrate with Polaris, and is planning on doing a minor release.


-> hudi-spark-xxx is required for hudi table support to work end to end, which is still under releasing.

will add this line

gh-yzou · 2025-07-28T17:51:52Z

plugins/spark/v3.5/spark/src/main/java/org/apache/polaris/spark/utils/PolarisCatalogUtils.java

    return DataSourceV2Utils.getTableFromProvider(
        provider, new CaseInsensitiveStringMap(tableProperties), scala.Option.empty());
  }

+  /** Return a Spark V1Table for Hudi tables. */
+  public static Table loadV1SparkHudiTable(


actually this function doesn't seem very specific for huid, maybe we can just call it loadV1SaprkTable, and in the comment mention that it is currently only used by hudi.

rahil-c · 2025-07-28T18:48:00Z

@flyrain @gh-yzou @eric-maynard #1862 (comment) has been resolved now based on recent changes.

Was wondering if we can land this as all comments should be addressed.

flyrain

LGTM. Thanks @rahil-c !

gh-yzou · 2025-07-29T01:56:49Z

@eric-maynard I think @rahil-c addressed all the comments. I am going to dismiss the requested change for this PR, so that we can move forward. Once you are back, maybe we can follow up with a post review.

eric-maynard · 2025-07-29T02:34:15Z

I’ll take a look soon — it looks like the PR was just updated yesterday to address the comments

eric-maynard · 2025-07-29T02:34:18Z

I’ll take a look soon — it looks like the PR was just updated yesterday to address the comments

eric-maynard · 2025-07-29T15:21:42Z

plugins/spark/README.md

@@ -124,3 +124,9 @@ Following describes the current functionality limitations of the Polaris Spark c
 3) Rename a Delta table is not supported.
 4) ALTER TABLE ... SET LOCATION is not supported for DELTA table.
 5) For other non-Iceberg tables like csv, it is not supported today.
+
+### Hudi Support
+Currently support for Hudi tables within the Polaris catalog is still under development. 


I'm confused as to why we would need this. If the integration needs changes on the Hudi side to work, why would we merge anything into Polaris now?

First off, we have already landed the initial changes on hudi side apache/hudi#13558 in order to integrate Hudi with Polaris, based on discussion between members of the Polaris community and Hudi community.

We have discussed between both communities that Polaris would need the latest Hudi release artifact, and for that we will need to do a hudi point release which I have already started the thread here https://lists.apache.org/thread/4ztwgclljojg7r08mzm2dkynwfrvjlqb.

As doing a release for any open source project can be timely(with alot of back and forth), we aligned that before even doing starting the hudi point release we should first ensure this initial polaris side change is landed, as this change is not even dependent on any hudi release artifact and would allow confidence to even start the point hudi release.

This was already aligned between both communities hence why this discussion thread was started yesterday:https://lists.apache.org/thread/k524b5xq7l75tzz6sdzth15wjxdgp3gf for our hudi code freeze process, as we had obtained approval of a PMC and a committer for this PR.

cc @flyrain @gh-yzou @singhpk234

I'm fine with working in parallel with the Hudi community. This PR servers as the first step of the integration, and we file follow-up PRs once Hudi 1.0.3 is out. With that, I guess this doc section itself is not quite necessary. It's normal that a feature is split to multiple PRs.

@eric-maynard, @rahil-c is trying to parallelize the development work based on his POC, the hudi development is already in and under the release, meanwhile, @rahil-c is trying to start the necessary development wrok in Polaris. The Polaris Hudi support is still under development (which is what the readme change intended to say), and once the hudi plugin is released, @rahil-c will follow up with integration and regression tests, and claim hudi is supported with clear documentation.
If the readme description is confusing, we can remove it or make it more clear if that works?

Hi all - FWIW.. we have active work around this on the next Hudi patch release, based on the feedback from the polaris community. So we are aligned on the goals of making it work.

based on discussion between members of the Polaris community and Hudi community.

we aligned that before even doing starting the hudi point release we should first ensure this initial polaris side change is landed,

This was already aligned between both communities

Did this happen on a mailing list? Can you point me to that?

It's less that the README is confusing (as Yufei said, it's normal to merge things that are in-development within a project) and more that it's confusing why we would merge something into Polaris because we expect an upstream dependency to eventually change. Couldn't we just forego the README note and hold the Polaris change until the dependency is ready?

eric-maynard · 2025-07-29T15:22:52Z

plugins/spark/v3.5/spark/src/main/java/org/apache/polaris/spark/utils/HudiHelper.java

+    } catch (ClassCastException e) {
+      throw new IllegalArgumentException(
+          String.format(
+              "Cannot initialize Hudi Catalog, %s does not implement Table Catalog.",


Is this supposed to say TableCatalog?

Let me fix this to say TableCatalog, i believe had gotten it from the setup of DeltaHelper https://github.com/apache/polaris/blob/main/plugins/spark/v3.5/spark/src/main/java/org/apache/polaris/spark/utils/DeltaHelper.java#L66

@rahil-c maybe fix both

Ack will do so

eric-maynard · 2025-07-29T15:23:04Z

plugins/spark/v3.5/spark/src/main/java/org/apache/polaris/spark/utils/PolarisCatalogUtils.java


 public class PolarisCatalogUtils {
+  private static final Logger LOG = LoggerFactory.getLogger(PolarisCatalogUtils.class);


We use LOGGER elsewhere

Oh, it looks like 2 classes (both in the client) use LOG, hm. I wouldn't fix that here, but maybe just stick with LOGGER

Yes we have usages of LOG in the following code paths.

https://github.com/apache/polaris/blob/main/plugins/spark/v3.5/spark/src/main/java/org/apache/polaris/spark/SparkCatalog.java#L69

https://github.com/apache/polaris/blob/main/plugins/spark/v3.5/spark/src/main/java/org/apache/polaris/spark/utils/DeltaHelper.java#L32

Personally Im not sure the naming between LOG and LOGGER makes a difference from either a functional or aesthetics perspective but I can make this change so we can move forward.

@rahil-c will you be able to update the other two to LOGGER if we are changing this one?

eric-maynard · 2025-07-29T15:23:56Z

plugins/spark/v3.5/spark/src/main/java/org/apache/polaris/spark/utils/PolarisCatalogUtils.java

+
+    // Currently Polaris generic table does not contain any schema information, partition columns,
+    // stats, etc
+    // for now we will just use fill the parameters we have from catalog, and let underlying client


use fill?

eric-maynard · 2025-07-29T15:25:21Z

plugins/spark/v3.5/spark/src/main/java/org/apache/polaris/spark/utils/PolarisCatalogUtils.java

+            Option.apply(genericTable.getFormat()),
+            emptyStringSeq,
+            scala.Option.empty(),
+            genericTable.getProperties().get("owner"),


This property currently isn't defined anywhere. Are you somehow setting it on writes?

If so, this should be a constant somewhere. If not, you should remove this.

On my side, I am not explicitly setting this property on hudi side changes, or in the polaris changes. This seems to be coming from Spark engine itself setting this value in the properties map.

For example this property gets propagated during the Polaris SparkCatalog#createTable which overrides Spark's TableCatalog interface. If you try testing with Delta in the create table and examine the properties map.

You can see the owner for the table is already set, before we even make a createGenericTable request

The createGenericTableRequest will then take those properties, and ensure they get persisted in the GenericTable object on the catalog side.

If the ask is to just have this "owner" be a constant variable called public static final String OWNER = "owner"; I can do that.

The ask is that, ideally, we can re-use the existing constant (which based on your comment looks to be coming from here). Barring that, yes, please make a new constant.

eric-maynard · 2025-07-29T15:26:46Z

plugins/spark/v3.5/spark/src/test/java/org/apache/polaris/spark/SparkCatalogTest.java

@@ -418,7 +449,6 @@ void testCreateAndLoadGenericTable(String format) throws Exception {
            () -> catalog.createTable(identifier, defaultSchema, new Transform[0], newProperties))
        .isInstanceOf(TableAlreadyExistsException.class);

-    // drop the iceberg table


Spurious change?

Yea will include this line back.

yihua

LGTM based on the alignment between Hudi and Polaris community. The integration will work on the patch release of Hudi containing this change apache/hudi#13558.

github-project-automation bot added this to Basic Kanban Board Jun 11, 2025

github-project-automation bot moved this to PRs In Progress in Basic Kanban Board Jun 11, 2025

rahil-c force-pushed the rahil-c/polaris-hudi branch from 37af09a to 98908b3 Compare June 13, 2025 15:27

rahil-c mentioned this pull request Jun 13, 2025

Add Hudi support within Polaris #1896

Open

rahil-c force-pushed the rahil-c/polaris-hudi branch from d0011d5 to 5445c48 Compare June 16, 2025 00:21

rahil-c force-pushed the rahil-c/polaris-hudi branch from 5b136d6 to 2bb83cd Compare July 1, 2025 07:20

rahil-c changed the title ~~[DRAFT] Initial integration for hudi tables within Polaris~~ Initial integration for hudi tables within Polaris Jul 1, 2025

rahil-c marked this pull request as ready for review July 1, 2025 07:20

rahil-c force-pushed the rahil-c/polaris-hudi branch from 2bb83cd to 6185ea6 Compare July 1, 2025 07:25

flyrain requested a review from Copilot July 1, 2025 15:54

Copilot AI reviewed Jul 1, 2025

View reviewed changes

flyrain requested a review from gh-yzou July 1, 2025 15:57

gh-yzou reviewed Jul 1, 2025

View reviewed changes

eric-maynard reviewed Jul 2, 2025

View reviewed changes

Davis-Zhang-Onehouse reviewed Jul 5, 2025

View reviewed changes

eric-maynard requested changes Jul 7, 2025

View reviewed changes

plugins/spark/v3.5/spark/src/main/java/org/apache/polaris/spark/PolarisSparkCatalog.java Outdated Show resolved Hide resolved

plugins/spark/v3.5/spark/src/main/java/org/apache/polaris/spark/SparkCatalog.java Outdated Show resolved Hide resolved

rahil-c commented Jul 23, 2025

View reviewed changes

settings.gradle.kts Outdated Show resolved Hide resolved

Initial integration for hudi tables within Polaris

087b408

rahil-c force-pushed the rahil-c/polaris-hudi branch from 7777796 to 087b408 Compare July 24, 2025 07:03

rahil-c requested a review from gh-yzou July 24, 2025 07:04

gh-yzou reviewed Jul 24, 2025

View reviewed changes

address yun comments

2b113c4

rahil-c force-pushed the rahil-c/polaris-hudi branch from d6f3175 to 2b113c4 Compare July 26, 2025 05:29

gh-yzou previously approved these changes Jul 28, 2025

View reviewed changes

gh-yzou reviewed Jul 28, 2025

View reviewed changes

address yun minor comments

a5a8be3

rahil-c dismissed gh-yzou’s stale review via a5a8be3 July 28, 2025 18:10

rahil-c requested a review from gh-yzou July 28, 2025 18:11

flyrain approved these changes Jul 28, 2025

View reviewed changes

gh-yzou approved these changes Jul 29, 2025

View reviewed changes

eric-maynard reviewed Jul 29, 2025

View reviewed changes

yihua approved these changes Jul 30, 2025

View reviewed changes


		public class PolarisCatalogUtils {
		private static final Logger LOG = LoggerFactory.getLogger(PolarisCatalogUtils.class);

Initial integration for hudi tables within Polaris #1862

Are you sure you want to change the base?

Initial integration for hudi tables within Polaris #1862

Uh oh!

Conversation

rahil-c commented Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Current behavior

Desired Behavior

Changes Included

Special note

Uh oh!

dimas-b commented Jun 11, 2025

Uh oh!

rahil-c commented Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rahil-c commented Jul 1, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gh-yzou commented Jul 8, 2025

Uh oh!

Uh oh!

rahil-c commented Jul 23, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rahil-c commented Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

flyrain left a comment

Choose a reason for hiding this comment

Uh oh!

rahil-c commented Jun 11, 2025 •

edited

Loading

rahil-c commented Jun 13, 2025 •

edited

Loading

rahil-c commented Jul 28, 2025 •

edited

Loading

rahil-c Jul 29, 2025 •

edited

Loading

vinothchandar Jul 30, 2025 •

edited

Loading