Skip to content

Initial integration for hudi tables within Polaris #1862

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

rahil-c
Copy link

@rahil-c rahil-c commented Jun 11, 2025

Motivation

Issue: #1896

The Polaris Spark client currently supports Iceberg and Delta table. This PR aims to add support for Apache Hudi tables as Generic Tables.

Current behavior

Currently, the Polaris Spark client routes Iceberg table requests to Iceberg REST endpoints and Delta table requests to
Generic Table REST endpoints. This PR aims to allow Hudi to follow a similar support as for what was done for Delta in Polaris.

Desired Behavior

Enable basic Hudi table operations through the Polaris Spark catalog by:

  • Adding Hudi table detection and routing logic

Changes Included

  • Core Implementation: Added HudiHelper utility class and enhanced PolarisCatalogUtils with Hudi-specific table loading
    logic
  • Catalog Integration: Modified SparkCatalog to detect and route Hudi table operations appropriately
  • Testing: Added unit tests for testingHudi integration
  • Documentation: Updated README with Hudi development support

Special note

Will follow up in another pr for integration and regression testing as they will need to consume the latest hudi point release artifact, once some changes in hudi land.

@dimas-b
Copy link
Contributor

dimas-b commented Jun 11, 2025

Thanks for you contribution, @rahil-c ! Would you mind opening a discussion for this feature on [email protected]?

@rahil-c rahil-c force-pushed the rahil-c/polaris-hudi branch from 37af09a to 98908b3 Compare June 13, 2025 15:27
@rahil-c
Copy link
Author

rahil-c commented Jun 13, 2025

Thanks @dimas-b will do so! Have raised a email on dev list here: https://lists.apache.org/thread/66d39oqkc412kk262gy80bm723r9xmpm

@rahil-c rahil-c force-pushed the rahil-c/polaris-hudi branch from d0011d5 to 5445c48 Compare June 16, 2025 00:21
@rahil-c rahil-c force-pushed the rahil-c/polaris-hudi branch from 5b136d6 to 2bb83cd Compare July 1, 2025 07:20
@rahil-c rahil-c changed the title [DRAFT] Initial integration for hudi tables within Polaris Initial integration for hudi tables within Polaris Jul 1, 2025
@rahil-c rahil-c marked this pull request as ready for review July 1, 2025 07:20
@rahil-c rahil-c force-pushed the rahil-c/polaris-hudi branch from 2bb83cd to 6185ea6 Compare July 1, 2025 07:25
@rahil-c
Copy link
Author

rahil-c commented Jul 1, 2025

cc @flyrain @gh-yzou @singhpk234

@flyrain flyrain requested a review from Copilot July 1, 2025 15:54
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces initial support for Hudi tables in the Polaris Spark catalog, enabling Hudi create/load operations alongside existing formats.

  • Extended parameterized tests to cover the new “hudi” format.
  • Added HudiHelper and HudiCatalogUtils for Hudi-specific catalog loading and namespace synchronization.
  • Updated SparkCatalog, PolarisCatalogUtils, and build configurations to wire in Hudi dependencies and behavior.

Reviewed Changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
plugins/spark/v3.5/spark/src/test/java/.../DeserializationTest.java Updated parameterized tests to accept and assert on format
plugins/spark/v3.5/spark/src/test/java/.../SparkCatalogTest.java Added static mocks and new Hudi namespace/table tests
plugins/spark/v3.5/spark/src/test/java/.../NoopHudiCatalog.java Created a no-op Hudi catalog stub for tests
plugins/spark/v3.5/spark/src/main/java/.../PolarisCatalogUtils.java Introduced useHudi, isHudiExtensionEnabled, Hudi load support, SQL builders
plugins/spark/v3.5/spark/src/main/java/.../HudiHelper.java New helper for instantiating and delegating to Hudi Catalog
plugins/spark/v3.5/spark/src/main/java/.../HudiCatalogUtils.java New utility for syncing namespace operations via SQL
plugins/spark/v3.5/spark/src/main/java/.../SparkCatalog.java Routed create/alter/drop to Hudi catalog when appropriate
plugins/spark/v3.5/spark/src/main/java/.../PolarisSparkCatalog.java Adjusted calls to pass Identifier through Hudi load API
plugins/spark/v3.5/spark/build.gradle.kts Added Hudi dependencies and exclusions
plugins/spark/v3.5/integration/.../logback.xml Enabled Hudi loggers for integration tests
plugins/spark/v3.5/integration/.../SparkHudiIT.java New integration tests for basic and unsupported Hudi ops
plugins/spark/v3.5/integration/build.gradle.kts Added Hive and Hudi bundles to integration dependencies
Comments suppressed due to low confidence (1)

@flyrain flyrain requested a review from gh-yzou July 1, 2025 15:57
@gh-yzou
Copy link
Contributor

gh-yzou commented Jul 8, 2025

@rahil-c sorry, i made my comment yesterday, but forgot to push it. I did a push, and added some more comments, please let me know if you have more questions about this!
As we have discussed, there are two main concerns for this PR:

  1. the hudi dependency introduced for spark client, which is caused by the usage of HoodieInternalV2Table. This can be resolved by loading V1Table, and then let HudiCatalog loadTable to handle the final table result https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/catalog/HoodieCatalog.scala#L123
  2. the extra namespace creation for HudiCatalog. Polaris Spark Client reuses the whole Iceberg namespace, ideally we do not want to maintain extra namespace creation just for specific table format. The needs of extra namespace creation is because HudiCatalog only works with SparkSession Catalog and HiveCatalog today https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/command/CreateHoodieTableCommand.scala#L198, however, since Polaris is rest catalog, this will not work anymore. We want to see if we can push forward on hudi community to improve the catalog implementation regarding to the third party catalog plugin. Similar as Delta did a special case for unity catalog here https://github.com/delta-io/delta/blob/2d89954008b6c53e49744f09435136c5c63b9f2c/spark/src/main/scala/org/apache/spark/sql/delta/catalog/DeltaCatalog.scala#L218

@rahil-c
Copy link
Author

rahil-c commented Jul 23, 2025

@rahil-c sorry, i made my comment yesterday, but forgot to push it. I did a push, and added some more comments, please let me know if you have more questions about this! As we have discussed, there are two main concerns for this PR:

  1. the hudi dependency introduced for spark client, which is caused by the usage of HoodieInternalV2Table. This can be resolved by loading V1Table, and then let HudiCatalog loadTable to handle the final table result https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/catalog/HoodieCatalog.scala#L123
  2. the extra namespace creation for HudiCatalog. Polaris Spark Client reuses the whole Iceberg namespace, ideally we do not want to maintain extra namespace creation just for specific table format. The needs of extra namespace creation is because HudiCatalog only works with SparkSession Catalog and HiveCatalog today https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/command/CreateHoodieTableCommand.scala#L198, however, since Polaris is rest catalog, this will not work anymore. We want to see if we can push forward on hudi community to improve the catalog implementation regarding to the third party catalog plugin. Similar as Delta did a special case for unity catalog here https://github.com/delta-io/delta/blob/2d89954008b6c53e49744f09435136c5c63b9f2c/spark/src/main/scala/org/apache/spark/sql/delta/catalog/DeltaCatalog.scala#L218

Thanks @gh-yzou, I have followed the recommendations above and updated the pr. Let me know if the approach looks good to you. If so then I can try to break this down into smaller prs.

@rahil-c rahil-c force-pushed the rahil-c/polaris-hudi branch from 7777796 to 087b408 Compare July 24, 2025 07:03
@rahil-c rahil-c requested a review from gh-yzou July 24, 2025 07:04
@rahil-c rahil-c force-pushed the rahil-c/polaris-hudi branch from d6f3175 to 2b113c4 Compare July 26, 2025 05:29
gh-yzou
gh-yzou previously approved these changes Jul 28, 2025

### Hudi Support
Currently support for Hudi tables within the Polaris catalog is still under development.
The Hudi community has made a change to integrate with Polaris, and is planning on doing a minor release.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> hudi-spark-xxx is required for hudi table support to work end to end, which is still under releasing.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will add this line

return DataSourceV2Utils.getTableFromProvider(
provider, new CaseInsensitiveStringMap(tableProperties), scala.Option.empty());
}

/** Return a Spark V1Table for Hudi tables. */
public static Table loadV1SparkHudiTable(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually this function doesn't seem very specific for huid, maybe we can just call it loadV1SaprkTable, and in the comment mention that it is currently only used by hudi.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do so

@rahil-c
Copy link
Author

rahil-c commented Jul 28, 2025

@flyrain @gh-yzou @eric-maynard #1862 (comment) has been resolved now based on recent changes.

Was wondering if we can land this as all comments should be addressed.

Copy link
Contributor

@flyrain flyrain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @rahil-c !

@gh-yzou
Copy link
Contributor

gh-yzou commented Jul 29, 2025

@eric-maynard I think @rahil-c addressed all the comments. I am going to dismiss the requested change for this PR, so that we can move forward. Once you are back, maybe we can follow up with a post review.

@eric-maynard
Copy link
Contributor

I’ll take a look soon — it looks like the PR was just updated yesterday to address the comments

1 similar comment
@eric-maynard
Copy link
Contributor

I’ll take a look soon — it looks like the PR was just updated yesterday to address the comments

@@ -124,3 +124,9 @@ Following describes the current functionality limitations of the Polaris Spark c
3) Rename a Delta table is not supported.
4) ALTER TABLE ... SET LOCATION is not supported for DELTA table.
5) For other non-Iceberg tables like csv, it is not supported today.

### Hudi Support
Currently support for Hudi tables within the Polaris catalog is still under development.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused as to why we would need this. If the integration needs changes on the Hudi side to work, why would we merge anything into Polaris now?

Copy link
Author

@rahil-c rahil-c Jul 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First off, we have already landed the initial changes on hudi side apache/hudi#13558 in order to integrate Hudi with Polaris, based on discussion between members of the Polaris community and Hudi community.

We have discussed between both communities that Polaris would need the latest Hudi release artifact, and for that we will need to do a hudi point release which I have already started the thread here https://lists.apache.org/thread/4ztwgclljojg7r08mzm2dkynwfrvjlqb.

As doing a release for any open source project can be timely(with alot of back and forth), we aligned that before even doing starting the hudi point release we should first ensure this initial polaris side change is landed, as this change is not even dependent on any hudi release artifact and would allow confidence to even start the point hudi release.

This was already aligned between both communities hence why this discussion thread was started yesterday:https://lists.apache.org/thread/k524b5xq7l75tzz6sdzth15wjxdgp3gf for our hudi code freeze process, as we had obtained approval of a PMC and a committer for this PR.

cc @flyrain @gh-yzou @singhpk234

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with working in parallel with the Hudi community. This PR servers as the first step of the integration, and we file follow-up PRs once Hudi 1.0.3 is out. With that, I guess this doc section itself is not quite necessary. It's normal that a feature is split to multiple PRs.

} catch (ClassCastException e) {
throw new IllegalArgumentException(
String.format(
"Cannot initialize Hudi Catalog, %s does not implement Table Catalog.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this supposed to say TableCatalog?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me fix this to say TableCatalog, i believe had gotten it from the setup of DeltaHelper https://github.com/apache/polaris/blob/main/plugins/spark/v3.5/spark/src/main/java/org/apache/polaris/spark/utils/DeltaHelper.java#L66


public class PolarisCatalogUtils {
private static final Logger LOG = LoggerFactory.getLogger(PolarisCatalogUtils.class);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use LOGGER elsewhere

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, it looks like 2 classes (both in the client) use LOG, hm. I wouldn't fix that here, but maybe just stick with LOGGER

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we have usages of LOG in the following code paths.

https://github.com/apache/polaris/blob/main/plugins/spark/v3.5/spark/src/main/java/org/apache/polaris/spark/SparkCatalog.java#L69

https://github.com/apache/polaris/blob/main/plugins/spark/v3.5/spark/src/main/java/org/apache/polaris/spark/utils/DeltaHelper.java#L32

Personally Im not sure the naming between LOG and LOGGER makes a difference from either a functional or aesthetics perspective but I can make this change so we can move forward.


// Currently Polaris generic table does not contain any schema information, partition columns,
// stats, etc
// for now we will just use fill the parameters we have from catalog, and let underlying client
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use fill?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix

Option.apply(genericTable.getFormat()),
emptyStringSeq,
scala.Option.empty(),
genericTable.getProperties().get("owner"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This property currently isn't defined anywhere. Are you somehow setting it on writes?

If so, this should be a constant somewhere. If not, you should remove this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On my side, I am not explicitly setting this property on hudi side changes, or in the polaris changes. This seems to be coming from Spark engine itself setting this value in the properties map.

For example this property gets propagated during the Polaris SparkCatalog#createTable which overrides Spark's TableCatalog interface. If you try testing with Delta in the create table and examine the properties map.

Screenshot 2025-07-29 at 12 13 55 PM

You can see the owner for the table is already set, before we even make a createGenericTable request
Screenshot 2025-07-29 at 12 14 01 PM
The createGenericTableRequest will then take those properties, and ensure they get persisted in the GenericTable object on the catalog side.

If the ask is to just have this "owner" be a constant variable called public static final String OWNER = "owner"; I can do that.

@@ -418,7 +449,6 @@ void testCreateAndLoadGenericTable(String format) throws Exception {
() -> catalog.createTable(identifier, defaultSchema, new Transform[0], newProperties))
.isInstanceOf(TableAlreadyExistsException.class);

// drop the iceberg table
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spurious change?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea will include this line back.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants