Skip to content

Conversation

@lifulong
Copy link

…proximate quantile computation, significantly improving merge performance

What changes were proposed in this pull request?

Use datasketches qualifie to replace spark default GK algorithm for speed up ApproximatePercentile performance
https://datasketches.apache.org/
https://github.com/apache/datasketches-java
i found that spark has use datasketches before, but why not replace approximate qualifie with datasketches?

Why are the changes needed?

https://issues.apache.org/jira/browse/SPARK-47836
https://issues.apache.org/jira/browse/SPARK-46706
https://issues.apache.org/jira/browse/SPARK-40499
multipe issues has reported spark3.x ApproximatePercentile performance problem, which introduce from this bug fix:https://issues.apache.org/jira/browse/SPARK-29336
the performance problem is because GK algorithm is not designed for distruibuted system, it's merge performance is bad, higher upstream stage parallelism leads to worse performance.
image

Use our produce env spark job as example, it deal with 60 billion records as source input, then sample with ratio 0.06, group by key (key has 4 distinct records), then calculate 1 to 100 percentile with accuracy 999 for 40 columns with spark conf spark.sql.shuffle.partitions=2000, each executor memory is 28g cores is 6
run with spark-2.4.3 the final merge stage cost is 5min
run with spark-3.5.2 the final merge stage cost is 2.8h
image

adjust spark.sql.shuffle.partitions to 500
run with spark-3.5.2 the final merge stage cost is 11min, but because the data is big, the upstream stage time cost will be increase a lot, and more data is spill to disk
image

when use datasketches qualifie
run with spark-3.5.2 the final merge stage cost is less than 1min with conf spark.sql.shuffle.partitions=2000
image

Does this PR introduce any user-facing change?

No

How was this patch tested?

var values = (1 to 100).toArray
var percents = (1 to 100).toArray
val all_quantiles = percents.indices.map(i => (i+1).toDouble / percents.length).toArray
val all_quantiles_str = s"ARRAY(${all_quantiles.toList.mkString(",")})"
for (n <- 0 until 5) {
var df = spark.sparkContext.makeRDD(values).toDF("value").repartition(5)
df.createOrReplaceTempView("data_table")
var sql = s"select PERCENTILE_APPROX(cast(value as DOUBLE), $all_quantiles_str, 90) as values from data_table"
val all_answers = spark.sql(sql).collect
val all_answered_ranks = all_answers.map(ans => values.indexOf(ans)).toArray
val error = all_answered_ranks.zipWithIndex.map({ case (answer, expected) => Math.abs(expected - answer) }).toArray
val max_error = error.max
print(max_error + "\n")
}
test code above the max_error is always 1, which is good than expect

var values = (1 to 10000).toArray
var percents = (1 to 100).toArray
val all_quantiles = percents.indices.map(i => (i+1).toDouble / percents.length).toArray
val all_quantiles_str = s"ARRAY(${all_quantiles.toList.mkString(",")})"
for (n <- 0 until 5) {
var df = spark.sparkContext.makeRDD(values).toDF("value").repartition(5)
df.createOrReplaceTempView("data_table")
var sql = s"select PERCENTILE_APPROX(cast(value as DOUBLE), $all_quantiles_str, 9999) as values from data_table"
val all_answers = spark.sql(sql).collect
val all_answered_ranks = all_answers.map(ans => values.indexOf(ans)).toArray
val error = all_answered_ranks.zipWithIndex.map({ case (answer, expected) => Math.abs(expected*100 - answer) }).toArray
val max_error = error.max
print(max_error + "\n")
}

test code above the max_error is always 1, which is as expect

also test with user produce env job for performance check

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the SQL label Oct 23, 2025
@lifulong lifulong force-pushed the quantiles_use_doubles_sketch_speedup branch 5 times, most recently from f3e6063 to 799ce36 Compare October 24, 2025 05:02
…proximate quantile computation, significantly improving merge performance
@lifulong lifulong force-pushed the quantiles_use_doubles_sketch_speedup branch from 799ce36 to 1c08239 Compare October 24, 2025 10:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants