Skip to content

Spark 4.1: Separate compaction and main operations#15301

Merged
aokolnychyi merged 4 commits intoapache:mainfrom
aokolnychyi:refactor-rewrites
Feb 17, 2026
Merged

Spark 4.1: Separate compaction and main operations#15301
aokolnychyi merged 4 commits intoapache:mainfrom
aokolnychyi:refactor-rewrites

Conversation

@aokolnychyi
Copy link
Contributor

@aokolnychyi aokolnychyi commented Feb 11, 2026

This PR pulls all compaction from main scans/writes in preparation for making the main scans and writes versioned.

This is a subset of changes from PR #15240.

@github-actions github-actions bot added the spark label Feb 11, 2026
spark()
.read()
.format("iceberg")
.option(SparkReadOptions.SCAN_TASK_SET_ID, groupId)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer needed. Just use group ID passed to table.

import org.apache.spark.sql.connector.expressions.Transform;
import org.apache.spark.sql.types.StructType;

abstract class BaseSparkTable
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There will be more extending this in future PRs.

SparkReadOptions.SCAN_TASK_SET_ID,
options.get(SparkWriteOptions.REWRITTEN_FILE_SCAN_TASK_SET_ID));
if (groupId != null) {
selector = REWRITE_SELECTOR;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rewrite selectors are no longer required.

}

private int specId(String fileSetId, List<PositionDeletesScanTask> tasks) {
private static int specId(String fileSetId, List<PositionDeletesScanTask> tasks) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Required to avoid checkstyle failures due to name collision (fileSetId).

@aokolnychyi
Copy link
Contributor Author

@aokolnychyi aokolnychyi force-pushed the refactor-rewrites branch 2 times, most recently from ac8db9d to f7901d9 Compare February 12, 2026 00:10
import org.apache.spark.sql.types.StructType;
import org.apache.spark.sql.util.CaseInsensitiveStringMap;

public class SparkRewriteTableCatalog implements TableCatalog, SupportsFunctions {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This supports the bare minimum for compaction.
Nothing fancy like branch selection is required right now.

@amogh-jahagirdar amogh-jahagirdar self-requested a review February 15, 2026 03:10
Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me, thanks @aokolnychyi . Just had a question on one of the test changes.

.format("iceberg")
.option(SparkReadOptions.SCAN_TASK_SET_ID, fileSetID)
.load(posDeletesTableName);
.option(SparkReadOptions.FILE_OPEN_COST, Integer.MAX_VALUE)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not entirely following why the file open cost needed to explicitly be set now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, typo. Good catch.

@aokolnychyi aokolnychyi merged commit 9ce0e6e into apache:main Feb 17, 2026
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants