-
Notifications
You must be signed in to change notification settings - Fork 281
Initial integration for hudi tables within Polaris #1862
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Thanks for you contribution, @rahil-c ! Would you mind opening a discussion for this feature on [email protected]? |
37af09a
to
98908b3
Compare
Thanks @dimas-b will do so! Have raised a email on dev list here: https://lists.apache.org/thread/66d39oqkc412kk262gy80bm723r9xmpm |
d0011d5
to
5445c48
Compare
5b136d6
to
2bb83cd
Compare
2bb83cd
to
6185ea6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces initial support for Hudi tables in the Polaris Spark catalog, enabling Hudi create/load operations alongside existing formats.
- Extended parameterized tests to cover the new “hudi” format.
- Added
HudiHelper
andHudiCatalogUtils
for Hudi-specific catalog loading and namespace synchronization. - Updated
SparkCatalog
,PolarisCatalogUtils
, and build configurations to wire in Hudi dependencies and behavior.
Reviewed Changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.
Show a summary per file
File | Description |
---|---|
plugins/spark/v3.5/spark/src/test/java/.../DeserializationTest.java | Updated parameterized tests to accept and assert on format |
plugins/spark/v3.5/spark/src/test/java/.../SparkCatalogTest.java | Added static mocks and new Hudi namespace/table tests |
plugins/spark/v3.5/spark/src/test/java/.../NoopHudiCatalog.java | Created a no-op Hudi catalog stub for tests |
plugins/spark/v3.5/spark/src/main/java/.../PolarisCatalogUtils.java | Introduced useHudi , isHudiExtensionEnabled , Hudi load support, SQL builders |
plugins/spark/v3.5/spark/src/main/java/.../HudiHelper.java | New helper for instantiating and delegating to Hudi Catalog |
plugins/spark/v3.5/spark/src/main/java/.../HudiCatalogUtils.java | New utility for syncing namespace operations via SQL |
plugins/spark/v3.5/spark/src/main/java/.../SparkCatalog.java | Routed create/alter/drop to Hudi catalog when appropriate |
plugins/spark/v3.5/spark/src/main/java/.../PolarisSparkCatalog.java | Adjusted calls to pass Identifier through Hudi load API |
plugins/spark/v3.5/spark/build.gradle.kts | Added Hudi dependencies and exclusions |
plugins/spark/v3.5/integration/.../logback.xml | Enabled Hudi loggers for integration tests |
plugins/spark/v3.5/integration/.../SparkHudiIT.java | New integration tests for basic and unsupported Hudi ops |
plugins/spark/v3.5/integration/build.gradle.kts | Added Hive and Hudi bundles to integration dependencies |
Comments suppressed due to low confidence (1)
plugins/spark/v3.5/spark/src/test/java/org/apache/polaris/spark/SparkCatalogTest.java
Show resolved
Hide resolved
plugins/spark/v3.5/spark/src/main/java/org/apache/polaris/spark/utils/HudiCatalogUtils.java
Outdated
Show resolved
Hide resolved
plugins/spark/v3.5/integration/src/intTest/resources/logback.xml
Outdated
Show resolved
Hide resolved
plugins/spark/v3.5/spark/src/main/java/org/apache/polaris/spark/SparkCatalog.java
Show resolved
Hide resolved
plugins/spark/v3.5/spark/src/main/java/org/apache/polaris/spark/SparkCatalog.java
Outdated
Show resolved
Hide resolved
plugins/spark/v3.5/spark/src/main/java/org/apache/polaris/spark/utils/HudiCatalogUtils.java
Outdated
Show resolved
Hide resolved
plugins/spark/v3.5/spark/src/main/java/org/apache/polaris/spark/utils/PolarisCatalogUtils.java
Outdated
Show resolved
Hide resolved
plugins/spark/v3.5/spark/src/main/java/org/apache/polaris/spark/SparkCatalog.java
Outdated
Show resolved
Hide resolved
...spark/v3.5/integration/src/intTest/java/org/apache/polaris/spark/quarkus/it/SparkHudiIT.java
Outdated
Show resolved
Hide resolved
...spark/v3.5/integration/src/intTest/java/org/apache/polaris/spark/quarkus/it/SparkHudiIT.java
Outdated
Show resolved
Hide resolved
...spark/v3.5/integration/src/intTest/java/org/apache/polaris/spark/quarkus/it/SparkHudiIT.java
Outdated
Show resolved
Hide resolved
plugins/spark/v3.5/spark/src/main/java/org/apache/polaris/spark/utils/HudiCatalogUtils.java
Outdated
Show resolved
Hide resolved
plugins/spark/v3.5/spark/src/main/java/org/apache/polaris/spark/utils/HudiCatalogUtils.java
Outdated
Show resolved
Hide resolved
plugins/spark/v3.5/spark/src/main/java/org/apache/polaris/spark/utils/HudiCatalogUtils.java
Outdated
Show resolved
Hide resolved
plugins/spark/v3.5/spark/src/main/java/org/apache/polaris/spark/utils/PolarisCatalogUtils.java
Outdated
Show resolved
Hide resolved
plugins/spark/v3.5/spark/src/main/java/org/apache/polaris/spark/PolarisSparkCatalog.java
Outdated
Show resolved
Hide resolved
...spark/v3.5/integration/src/intTest/java/org/apache/polaris/spark/quarkus/it/SparkHudiIT.java
Outdated
Show resolved
Hide resolved
...spark/v3.5/integration/src/intTest/java/org/apache/polaris/spark/quarkus/it/SparkHudiIT.java
Outdated
Show resolved
Hide resolved
plugins/spark/v3.5/spark/src/main/java/org/apache/polaris/spark/SparkCatalog.java
Show resolved
Hide resolved
plugins/spark/v3.5/spark/src/main/java/org/apache/polaris/spark/SparkCatalog.java
Show resolved
Hide resolved
plugins/spark/v3.5/spark/src/main/java/org/apache/polaris/spark/PolarisSparkCatalog.java
Outdated
Show resolved
Hide resolved
plugins/spark/v3.5/spark/src/main/java/org/apache/polaris/spark/SparkCatalog.java
Outdated
Show resolved
Hide resolved
@rahil-c sorry, i made my comment yesterday, but forgot to push it. I did a push, and added some more comments, please let me know if you have more questions about this!
|
Thanks @gh-yzou, I have followed the recommendations above and updated the pr. Let me know if the approach looks good to you. If so then I can try to break this down into smaller prs. |
7777796
to
087b408
Compare
plugins/spark/v3.5/spark/src/main/java/org/apache/polaris/spark/utils/PolarisCatalogUtils.java
Outdated
Show resolved
Hide resolved
plugins/spark/v3.5/spark/src/main/java/org/apache/polaris/spark/utils/PolarisCatalogUtils.java
Outdated
Show resolved
Hide resolved
plugins/spark/v3.5/spark/src/main/java/org/apache/polaris/spark/SparkCatalog.java
Outdated
Show resolved
Hide resolved
plugins/spark/v3.5/spark/src/main/java/org/apache/polaris/spark/utils/PolarisCatalogUtils.java
Outdated
Show resolved
Hide resolved
plugins/spark/v3.5/spark/src/main/java/org/apache/polaris/spark/utils/PolarisCatalogUtils.java
Outdated
Show resolved
Hide resolved
d6f3175
to
2b113c4
Compare
plugins/spark/README.md
Outdated
|
||
### Hudi Support | ||
Currently support for Hudi tables within the Polaris catalog is still under development. | ||
The Hudi community has made a change to integrate with Polaris, and is planning on doing a minor release. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-> hudi-spark-xxx is required for hudi table support to work end to end, which is still under releasing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will add this line
return DataSourceV2Utils.getTableFromProvider( | ||
provider, new CaseInsensitiveStringMap(tableProperties), scala.Option.empty()); | ||
} | ||
|
||
/** Return a Spark V1Table for Hudi tables. */ | ||
public static Table loadV1SparkHudiTable( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually this function doesn't seem very specific for huid, maybe we can just call it loadV1SaprkTable, and in the comment mention that it is currently only used by hudi.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do so
@flyrain @gh-yzou @eric-maynard #1862 (comment) has been resolved now based on recent changes. Was wondering if we can land this as all comments should be addressed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @rahil-c !
@eric-maynard I think @rahil-c addressed all the comments. I am going to dismiss the requested change for this PR, so that we can move forward. Once you are back, maybe we can follow up with a post review. |
I’ll take a look soon — it looks like the PR was just updated yesterday to address the comments |
1 similar comment
I’ll take a look soon — it looks like the PR was just updated yesterday to address the comments |
@@ -124,3 +124,9 @@ Following describes the current functionality limitations of the Polaris Spark c | |||
3) Rename a Delta table is not supported. | |||
4) ALTER TABLE ... SET LOCATION is not supported for DELTA table. | |||
5) For other non-Iceberg tables like csv, it is not supported today. | |||
|
|||
### Hudi Support | |||
Currently support for Hudi tables within the Polaris catalog is still under development. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused as to why we would need this. If the integration needs changes on the Hudi side to work, why would we merge anything into Polaris now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First off, we have already landed the initial changes on hudi side apache/hudi#13558 in order to integrate Hudi with Polaris, based on discussion between members of the Polaris community and Hudi community.
We have discussed between both communities that Polaris would need the latest Hudi release artifact, and for that we will need to do a hudi point release which I have already started the thread here https://lists.apache.org/thread/4ztwgclljojg7r08mzm2dkynwfrvjlqb.
As doing a release for any open source project can be timely(with alot of back and forth), we aligned that before even doing starting the hudi point release we should first ensure this initial polaris side change is landed, as this change is not even dependent on any hudi release artifact and would allow confidence to even start the point hudi release.
This was already aligned between both communities hence why this discussion thread was started yesterday:https://lists.apache.org/thread/k524b5xq7l75tzz6sdzth15wjxdgp3gf for our hudi code freeze process, as we had obtained approval of a PMC and a committer for this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with working in parallel with the Hudi community. This PR servers as the first step of the integration, and we file follow-up PRs once Hudi 1.0.3 is out. With that, I guess this doc section itself is not quite necessary. It's normal that a feature is split to multiple PRs.
} catch (ClassCastException e) { | ||
throw new IllegalArgumentException( | ||
String.format( | ||
"Cannot initialize Hudi Catalog, %s does not implement Table Catalog.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this supposed to say TableCatalog?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me fix this to say TableCatalog
, i believe had gotten it from the setup of DeltaHelper
https://github.com/apache/polaris/blob/main/plugins/spark/v3.5/spark/src/main/java/org/apache/polaris/spark/utils/DeltaHelper.java#L66
|
||
public class PolarisCatalogUtils { | ||
private static final Logger LOG = LoggerFactory.getLogger(PolarisCatalogUtils.class); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use LOGGER elsewhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, it looks like 2 classes (both in the client) use LOG, hm. I wouldn't fix that here, but maybe just stick with LOGGER
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we have usages of LOG
in the following code paths.
Personally Im not sure the naming between LOG
and LOGGER
makes a difference from either a functional or aesthetics perspective but I can make this change so we can move forward.
|
||
// Currently Polaris generic table does not contain any schema information, partition columns, | ||
// stats, etc | ||
// for now we will just use fill the parameters we have from catalog, and let underlying client |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use fill
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will fix
Option.apply(genericTable.getFormat()), | ||
emptyStringSeq, | ||
scala.Option.empty(), | ||
genericTable.getProperties().get("owner"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This property currently isn't defined anywhere. Are you somehow setting it on writes?
If so, this should be a constant somewhere. If not, you should remove this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On my side, I am not explicitly setting this property on hudi side changes, or in the polaris changes. This seems to be coming from Spark engine itself setting this value in the properties map.
For example this property gets propagated during the Polaris SparkCatalog#createTable
which overrides Spark's TableCatalog
interface. If you try testing with Delta
in the create table and examine the properties
map.

You can see the owner
for the table is already set, before we even make a createGenericTable request
The createGenericTableRequest
will then take those properties, and ensure they get persisted in the GenericTable
object on the catalog side.
If the ask is to just have this "owner"
be a constant variable called public static final String OWNER = "owner";
I can do that.
@@ -418,7 +449,6 @@ void testCreateAndLoadGenericTable(String format) throws Exception { | |||
() -> catalog.createTable(identifier, defaultSchema, new Transform[0], newProperties)) | |||
.isInstanceOf(TableAlreadyExistsException.class); | |||
|
|||
// drop the iceberg table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spurious change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea will include this line back.
Motivation
Issue: #1896
The Polaris Spark client currently supports Iceberg and Delta table. This PR aims to add support for Apache Hudi tables as Generic Tables.
Current behavior
Currently, the Polaris Spark client routes Iceberg table requests to Iceberg REST endpoints and Delta table requests to
Generic Table REST endpoints. This PR aims to allow Hudi to follow a similar support as for what was done for Delta in Polaris.
Desired Behavior
Enable basic Hudi table operations through the Polaris Spark catalog by:
Changes Included
logic
Special note
Will follow up in another pr for integration and regression testing as they will need to consume the latest hudi point release artifact, once some changes in hudi land.