ClickHouse Performance Optimizations by Tencent #412

amosbird · 2025-06-23T16:11:15Z

This submission builds on top of the latest ClickHouse with a series of performance optimizations, developed with support from Tencent. Each optimization has been carefully validated and is intended to be contributed upstream incrementally through individual PRs—some of which have already been merged.

Benchmark results were generated using artifacts built by the official CI pipeline of #81944, with great help from @nickitat — thank you!

The following optimizations are included:

1. Push TopN threshold to `MergeTreeSource`

Pushes the TopN threshold into MergeTreeSource to enable early filtering during the read phase. By passing the (N–1)th threshold value from the TopN state, rows below the threshold can be skipped earlier, reducing IO and improving performance.

2. Precompute hashes and prefetch for prealloc variants (previous prealloc optimization)

For ColumnsHashing implementations that support the prealloc strategy:

Key hashes are precomputed before any potential serialization.
These hashes are used to prefetch hash table cells more efficiently.
Serialization is skipped when probing by hash fails (e.g., for group-by overflow rows).

Also introduced the optimize_trivial_group_by_limit_query setting, which applies max_rows_to_group_by for trivial GROUP BY LIMIT queries to avoid unnecessary aggregation work.

3. Extend string hash map with inlined hash

The string hash map is optimized by combining string length and hash into a single 8-byte value. Since most string lengths and CRC32 hashes fit within 4 bytes, combining them:

Produces a more compact cell representation.
Enables faster string comparison through a single comparison of combined length and hash.
Improves performance by reducing memory footprint and branching.

4. Optimize index analysis with earlier QCC filtering (#82380)

Refactored the integration of Query Condition Cache (QCC) with index analysis:

QCC filtering is now applied before primary key and skip index evaluation, reducing redundant index computations.
Index analysis now supports multiple range filters and caches the filtering results back into QCC.

This notably accelerates short queries when index analysis is the dominant cost.

5. Optimize single `COUNT()` aggregation on `NOT NULL` columns (#82104)

When an aggregation query only includes a single COUNT() on a NOT NULL column:

The aggregation logic is fully inlined during hash table probing.
No aggregation state needs to be allocated or maintained.

This reduces memory usage and CPU overhead, significantly speeding up the aggregation.

6. Rewrite regular expression functions into simplified forms (#81992)

Primarily targets Q28. Introduced the optimize_rewrite_regexp_functions setting (enabled by default), allowing the optimizer to rewrite certain calls to replaceRegexpAll, replaceRegexpOne, and extract into simpler and faster forms when specific patterns are detected.

Additionally:

Enabled count_distinct_optimization by default with several related edge cases fixed.

All these optimizations have been tested and validated via the ClickHouse CI pipeline. Although benchmarked on ClickBench, they were made possible thanks to the extensive support and real-world production environment provided by Tencent (TCHouse-C). I'm continuously working on additional improvements, and will persist in contributing until ClickHouse achieves top-tier performance on ClickBench once more :)

rschu1ze · 2025-06-23T20:09:00Z

This is excellent, thanks!

This PR against the ClickBench repository is similar in spirit as @kitaisreal's Ursa (i.e. a research fork of ClickHouse). If all PRs are being integrated into the main codebase anyways, perhaps we don't need this PR (or can keep it open and continuosly update it for the time being)?

amosbird · 2025-06-24T01:48:56Z

Thanks for the feedback!

I'd actually prefer to have this PR merged into the ClickBench repository for a few reasons:

A continuously updated tuned baseline: This fork plays a similar role to clickhouse-tuned — a place where performance-oriented patches can be evaluated holistically. It gives us a stable, visible comparison point against upstream ClickHouse, even before all PRs are merged. This helps track net gains across batches of changes.
Surfacing trade-offs and non-merged work: Not all optimizations may land in upstream immediately — some might be blocked due to generality, compatibility, or maintenance concerns. Keeping this variant in the ClickBench repo allows us to observe and quantify the impact of those changes, even if they're not ultimately accepted upstream. It serves as a real-world benchmark of possible trade-offs.
Clear visibility of contribution impact: From my contributor's perspective, having this fork merged into ClickBench makes the performance gains more tangible and attributable. It is highly motivating and helps justify the continued investment.

This PR against the ClickBench repository is similar in spirit as @kitaisreal's Ursa (i.e. a research fork of ClickHouse).

Interesting — the string layout modification mentioned there is also implemented in ByConity (as BigString). I’ve encountered a similar need when working on the projection index feature (row-level index), where faster row seeking on string columns is critical. I’ll look into whether we can achieve this in a backward-compatible way.

clickhouse-tencent/benchmark.sh

alexey-milovidov · 2025-06-25T02:26:21Z

As long as the results are reproducible, let's merge.

alexey-milovidov · 2025-06-25T02:33:28Z

I’ll look into whether we can achieve this in a backward-compatible way.

Yes, it is entirely possible.

name the new String data type String_v2, rename the old one to String_v1;
introduce a table-level setting to interpret the String name as either String_v1 or String_v2;
creation of a table with String will rewrite it to either String_v1 or String_v2;
loading and ATTACHing a table with String will interpret it as String_v1;
the native protocol will serialize String_v1 and String_v2 as the old String with conversion;

alexey-milovidov · 2025-06-25T02:53:28Z

which applies max_rows_to_group_by for trivial GROUP BY LIMIT queries to avoid unnecessary aggregation work

Thanks, I wanted it for a long time!

amosbird · 2025-06-25T02:55:36Z

Yes, it is entirely possible.

Hmm, I was actually thinking of a different strategy: keep using the same type, but recognize the underlying streams — and if there's a separate size stream, apply a new serde logic accordingly. This behavior would only apply to MergeTree's wide format, which I believe should be sufficient.

alexey-milovidov · 2025-06-25T02:57:48Z

Maybe we can try. Although, having to look up an additional file looks hacky.
Another solution is to approach it as different column representations, like ColumnSparse (how do we store that a certain column is in the Sparse format?).

amosbird · 2025-06-25T03:02:10Z

Maybe we can try. Although, having to look up an additional file looks hacky.
Another solution is to approach it as different column representations, like ColumnSparse (how do we store that a certain column is in the Sparse format?).

Sure, a different serde in serialization.json is definitely better than looking for files.

amosbird · 2025-06-25T06:17:43Z

As long as the results are reproducible, let's merge.

I've just merged an additional optimization from my team that addresses the Q23 issue . With this fix, the results should now be fully reproducible without any manual post-processing.

I've updated benchmark.sh to use the following binary:

https://clickhouse-builds.s3.amazonaws.com/PRs/81944/cda07f8aca770d97ea149eec6b477dcfd59d134e/build_amd_release/clickhouse-common-static-25.7.1.1-amd64.tgz

@rschu1ze @nickitat Could you help re-run the benchmarks and update the results on both c6a.metal and c6a.4xlarge, as those were mentioned in the Firebolt PR as the most commonly used environments?

Thanks a lot!

Add TCHouse-C result

110ccbf

nickitat self-assigned this Jun 23, 2025

alexey-milovidov reviewed Jun 25, 2025

View reviewed changes

clickhouse-tencent/benchmark.sh Outdated Show resolved Hide resolved

Update benchmark.sh

4a7c210

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ClickHouse Performance Optimizations by Tencent #412

ClickHouse Performance Optimizations by Tencent #412

amosbird commented Jun 23, 2025 •

edited

Loading

Uh oh!

rschu1ze commented Jun 23, 2025

Uh oh!

amosbird commented Jun 24, 2025

Uh oh!

Uh oh!

alexey-milovidov commented Jun 25, 2025

Uh oh!

alexey-milovidov commented Jun 25, 2025

Uh oh!

alexey-milovidov commented Jun 25, 2025

Uh oh!

amosbird commented Jun 25, 2025

Uh oh!

alexey-milovidov commented Jun 25, 2025

Uh oh!

amosbird commented Jun 25, 2025

Uh oh!

amosbird commented Jun 25, 2025 •

edited

Loading

Uh oh!

Uh oh!

ClickHouse Performance Optimizations by Tencent #412

Are you sure you want to change the base?

ClickHouse Performance Optimizations by Tencent #412

Conversation

amosbird commented Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. Push TopN threshold to MergeTreeSource

2. Precompute hashes and prefetch for prealloc variants (previous prealloc optimization)

3. Extend string hash map with inlined hash

4. Optimize index analysis with earlier QCC filtering (#82380)

5. Optimize single COUNT() aggregation on NOT NULL columns (#82104)

6. Rewrite regular expression functions into simplified forms (#81992)

Uh oh!

rschu1ze commented Jun 23, 2025

Uh oh!

amosbird commented Jun 24, 2025

Uh oh!

Uh oh!

alexey-milovidov commented Jun 25, 2025

Uh oh!

alexey-milovidov commented Jun 25, 2025

Uh oh!

alexey-milovidov commented Jun 25, 2025

Uh oh!

amosbird commented Jun 25, 2025

Uh oh!

alexey-milovidov commented Jun 25, 2025

Uh oh!

amosbird commented Jun 25, 2025

Uh oh!

amosbird commented Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

amosbird commented Jun 23, 2025 •

edited

Loading

1. Push TopN threshold to `MergeTreeSource`

5. Optimize single `COUNT()` aggregation on `NOT NULL` columns (#82104)

amosbird commented Jun 25, 2025 •

edited

Loading