Skip to content

[SPARK-55645][SQL][FOLLOWUP] Move serdeName to last parameter and filter empty strings#54860

Open
cloud-fan wants to merge 4 commits intoapache:masterfrom
cloud-fan:SPARK-55645-followup
Open

[SPARK-55645][SQL][FOLLOWUP] Move serdeName to last parameter and filter empty strings#54860
cloud-fan wants to merge 4 commits intoapache:masterfrom
cloud-fan:SPARK-55645-followup

Conversation

@cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

Two followup improvements to #54467 (SPARK-55645):

  1. Move serdeName to the last parameter of CatalogStorageFormat with a default value of None, so that existing callers that construct CatalogStorageFormat positionally remain source-compatible without code changes.

  2. Filter empty strings when reading serdeName from the Hive Metastore API — Hive returns "" for tables without an explicit serde name, which should map to None rather than Some("").

Why are the changes needed?

  1. Adding serdeName as a required positional parameter in the middle of the parameter list breaks source compatibility for all external callers (e.g., third-party connectors) that construct CatalogStorageFormat positionally. Moving it to the last position with a default value avoids this.

  2. The Hive Metastore returns an empty string for SerDeInfo.name when no serde name is explicitly set. Without filtering, Option("") produces Some("") instead of the semantically correct None, which could cause unexpected behavior in downstream code that checks serdeName.isDefined or pattern-matches on it.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added a new test serdeName should be None for tables without an explicit serde name that verifies the empty string filtering. Existing tests cover the parameter reordering.

Was this patch authored or co-authored using generative AI tooling?

Yes

…ter empty strings

Move serdeName to the last parameter of CatalogStorageFormat with a
default value of None, so that existing callers that construct
CatalogStorageFormat positionally (e.g., third-party connectors like
Hudi) remain source-compatible without code changes.

Also filter empty strings when reading serdeName from the Hive
Metastore API — Hive returns "" for tables without an explicit serde
name, which should map to None rather than Some("").
@cloud-fan
Copy link
Contributor Author

@sarutak @pan3793 @tagatac

@sarutak
Copy link
Member

sarutak commented Mar 17, 2026

Adding serdeName as a required positional parameter in the middle of the parameter list breaks source compatibility for all external callers (e.g., third-party connectors) that construct CatalogStorageFormat positionally. Moving it to the last position with a default value avoids this.

I agree as I was concerned here but what do you think, @pan3793 ? You were concerned about hidden bugs.

@pan3793
Copy link
Member

pan3793 commented Mar 17, 2026

I agree with mapping "" to None rather than Some(""), but I don't want to remain source-compatible but actually break binary-compatible - I used to spend a lot of time diagnosing such issues when developing 3rd libs. Downstream projects like Delta and Iceberg already have a shim layer or Spark version-specific folder to manage source code for different Spark branches, this should not introduce much burden for them.

val options = storage.properties + (ParquetOptions.MERGE_SCHEMA ->
SQLConf.get.getConf(HiveUtils.CONVERT_METASTORE_PARQUET_WITH_SCHEMA_MERGING).toString)
storage.copy(
serdeName = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry if this change was not correct in the original PR, but doesn't unsetting serdeName match the sentiment of unsetting serde here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's just a name for display, so does not really matter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but still better to clear it.

val options = storage.properties
if (SQLConf.get.getConf(SQLConf.ORC_IMPLEMENTATION) == "native") {
storage.copy(
serdeName = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question as line 180.

)
} else {
storage.copy(
serdeName = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question as line 180.

@cloud-fan
Copy link
Contributor Author

This is just a name for display in DESC TABLE, I don't think it worths people's attention and effort by breaking source-compatibility.

@pan3793
Copy link
Member

pan3793 commented Mar 17, 2026

@cloud-fan, many spark apps/plugins use spark public api with only a few set private api, and they tend to create a unified jar to be compatible with multiple Spark versions. in such cases, source-compatible but binary-incompatible easily cause hidden bugs.

one example is in apache kyuubi, we build a thrift server on the driver as a spark app, which is similar to STS and is a unified jar that can run on all Spark 3.x versions. while a source compatible method change of spark method signature breaks our assumption - the worst thing is, though we have built multi-level tests to check the compatibility (compile/run unit tests with different spark versions, compile with the default spark version and submit the app jar to all other supported spark versions, and run integration tests), but it's related to web ui, which is not covered by our integration tests, thus the bug was not caught.

for cases like delta and iceberg, which use spark private api heavily, should have shim layer to easily adapt to such change.

@cloud-fan
Copy link
Contributor Author

I don't get the point, what can do wrong if we miss to set serde name? I won't have agreed to merge the PR at the first place if we break compatibility for such a small feature.

Copy link
Member

@pan3793 pan3793 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in offline discussion, @cloud-fan provided a good case (scala notebook) for the benefit of keeping source compatibility.

@tagatac @sarutak, apologize for making things complicated.

Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants