Spark 4.1: Separate compaction and main operations#15301
Spark 4.1: Separate compaction and main operations#15301aokolnychyi merged 4 commits intoapache:mainfrom
Conversation
| spark() | ||
| .read() | ||
| .format("iceberg") | ||
| .option(SparkReadOptions.SCAN_TASK_SET_ID, groupId) |
There was a problem hiding this comment.
No longer needed. Just use group ID passed to table.
| import org.apache.spark.sql.connector.expressions.Transform; | ||
| import org.apache.spark.sql.types.StructType; | ||
|
|
||
| abstract class BaseSparkTable |
There was a problem hiding this comment.
There will be more extending this in future PRs.
315d160 to
edd421f
Compare
| SparkReadOptions.SCAN_TASK_SET_ID, | ||
| options.get(SparkWriteOptions.REWRITTEN_FILE_SCAN_TASK_SET_ID)); | ||
| if (groupId != null) { | ||
| selector = REWRITE_SELECTOR; |
There was a problem hiding this comment.
Rewrite selectors are no longer required.
| } | ||
|
|
||
| private int specId(String fileSetId, List<PositionDeletesScanTask> tasks) { | ||
| private static int specId(String fileSetId, List<PositionDeletesScanTask> tasks) { |
There was a problem hiding this comment.
Required to avoid checkstyle failures due to name collision (fileSetId).
ac8db9d to
f7901d9
Compare
f7901d9 to
d2c7cd5
Compare
| import org.apache.spark.sql.types.StructType; | ||
| import org.apache.spark.sql.util.CaseInsensitiveStringMap; | ||
|
|
||
| public class SparkRewriteTableCatalog implements TableCatalog, SupportsFunctions { |
There was a problem hiding this comment.
This supports the bare minimum for compaction.
Nothing fancy like branch selection is required right now.
amogh-jahagirdar
left a comment
There was a problem hiding this comment.
Overall looks good to me, thanks @aokolnychyi . Just had a question on one of the test changes.
| .format("iceberg") | ||
| .option(SparkReadOptions.SCAN_TASK_SET_ID, fileSetID) | ||
| .load(posDeletesTableName); | ||
| .option(SparkReadOptions.FILE_OPEN_COST, Integer.MAX_VALUE) |
There was a problem hiding this comment.
Not entirely following why the file open cost needed to explicitly be set now?
There was a problem hiding this comment.
Oops, typo. Good catch.
This PR pulls all compaction from main scans/writes in preparation for making the main scans and writes versioned.
This is a subset of changes from PR #15240.