Skip to content

ClickHouse Performance Optimizations by Tencent #412

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

amosbird
Copy link

@amosbird amosbird commented Jun 23, 2025

This submission builds on top of the latest ClickHouse with a series of performance optimizations, developed with support from Tencent. Each optimization has been carefully validated and is intended to be contributed upstream incrementally through individual PRs—some of which have already been merged.

Benchmark results were generated using artifacts built by the official CI pipeline of #81944, with great help from @nickitat — thank you!

The following optimizations are included:

1. Push TopN threshold to MergeTreeSource

Pushes the TopN threshold into MergeTreeSource to enable early filtering during the read phase. By passing the (N–1)th threshold value from the TopN state, rows below the threshold can be skipped earlier, reducing IO and improving performance.

2. Precompute hashes and prefetch for prealloc variants (previous prealloc optimization)

For ColumnsHashing implementations that support the prealloc strategy:

  • Key hashes are precomputed before any potential serialization.
  • These hashes are used to prefetch hash table cells more efficiently.
  • Serialization is skipped when probing by hash fails (e.g., for group-by overflow rows).

Also introduced the optimize_trivial_group_by_limit_query setting, which applies max_rows_to_group_by for trivial GROUP BY LIMIT queries to avoid unnecessary aggregation work.

3. Extend string hash map with inlined hash

The string hash map is optimized by combining string length and hash into a single 8-byte value. Since most string lengths and CRC32 hashes fit within 4 bytes, combining them:

  • Produces a more compact cell representation.
  • Enables faster string comparison through a single comparison of combined length and hash.
  • Improves performance by reducing memory footprint and branching.

4. Optimize index analysis with earlier QCC filtering (#82380)

Refactored the integration of Query Condition Cache (QCC) with index analysis:

  • QCC filtering is now applied before primary key and skip index evaluation, reducing redundant index computations.
  • Index analysis now supports multiple range filters and caches the filtering results back into QCC.

This notably accelerates short queries when index analysis is the dominant cost.

5. Optimize single COUNT() aggregation on NOT NULL columns (#82104)

When an aggregation query only includes a single COUNT() on a NOT NULL column:

  • The aggregation logic is fully inlined during hash table probing.
  • No aggregation state needs to be allocated or maintained.

This reduces memory usage and CPU overhead, significantly speeding up the aggregation.

6. Rewrite regular expression functions into simplified forms (#81992)

Primarily targets Q28. Introduced the optimize_rewrite_regexp_functions setting (enabled by default), allowing the optimizer to rewrite certain calls to replaceRegexpAll, replaceRegexpOne, and extract into simpler and faster forms when specific patterns are detected.

Additionally:

  • Enabled count_distinct_optimization by default with several related edge cases fixed.

All these optimizations have been tested and validated via the ClickHouse CI pipeline. Although benchmarked on ClickBench, they were made possible thanks to the extensive support and real-world production environment provided by Tencent (TCHouse-C). I'm continuously working on additional improvements, and will persist in contributing until ClickHouse achieves top-tier performance on ClickBench once more :)

@nickitat nickitat self-assigned this Jun 23, 2025
@rschu1ze
Copy link
Member

This is excellent, thanks!

This PR against the ClickBench repository is similar in spirit as @kitaisreal's Ursa (i.e. a research fork of ClickHouse). If all PRs are being integrated into the main codebase anyways, perhaps we don't need this PR (or can keep it open and continuosly update it for the time being)?

@amosbird
Copy link
Author

Thanks for the feedback!

I'd actually prefer to have this PR merged into the ClickBench repository for a few reasons:

  • A continuously updated tuned baseline: This fork plays a similar role to clickhouse-tuned — a place where performance-oriented patches can be evaluated holistically. It gives us a stable, visible comparison point against upstream ClickHouse, even before all PRs are merged. This helps track net gains across batches of changes.

  • Surfacing trade-offs and non-merged work: Not all optimizations may land in upstream immediately — some might be blocked due to generality, compatibility, or maintenance concerns. Keeping this variant in the ClickBench repo allows us to observe and quantify the impact of those changes, even if they're not ultimately accepted upstream. It serves as a real-world benchmark of possible trade-offs.

  • Clear visibility of contribution impact: From my contributor's perspective, having this fork merged into ClickBench makes the performance gains more tangible and attributable. It is highly motivating and helps justify the continued investment.

This PR against the ClickBench repository is similar in spirit as @kitaisreal's Ursa (i.e. a research fork of ClickHouse).

Interesting — the string layout modification mentioned there is also implemented in ByConity (as BigString). I’ve encountered a similar need when working on the projection index feature (row-level index), where faster row seeking on string columns is critical. I’ll look into whether we can achieve this in a backward-compatible way.

@alexey-milovidov
Copy link
Member

As long as the results are reproducible, let's merge.

@alexey-milovidov
Copy link
Member

I’ll look into whether we can achieve this in a backward-compatible way.

Yes, it is entirely possible.

  • name the new String data type String_v2, rename the old one to String_v1;
  • introduce a table-level setting to interpret the String name as either String_v1 or String_v2;
  • creation of a table with String will rewrite it to either String_v1 or String_v2;
  • loading and ATTACHing a table with String will interpret it as String_v1;
  • the native protocol will serialize String_v1 and String_v2 as the old String with conversion;

@alexey-milovidov
Copy link
Member

which applies max_rows_to_group_by for trivial GROUP BY LIMIT queries to avoid unnecessary aggregation work

Thanks, I wanted it for a long time!

@amosbird
Copy link
Author

Yes, it is entirely possible.

Hmm, I was actually thinking of a different strategy: keep using the same type, but recognize the underlying streams — and if there's a separate size stream, apply a new serde logic accordingly. This behavior would only apply to MergeTree's wide format, which I believe should be sufficient.

@alexey-milovidov
Copy link
Member

Maybe we can try. Although, having to look up an additional file looks hacky.
Another solution is to approach it as different column representations, like ColumnSparse (how do we store that a certain column is in the Sparse format?).

@amosbird
Copy link
Author

Maybe we can try. Although, having to look up an additional file looks hacky.
Another solution is to approach it as different column representations, like ColumnSparse (how do we store that a certain column is in the Sparse format?).

Sure, a different serde in serialization.json is definitely better than looking for files.

@amosbird
Copy link
Author

amosbird commented Jun 25, 2025

As long as the results are reproducible, let's merge.

I've just merged an additional optimization from my team that addresses the Q23 issue . With this fix, the results should now be fully reproducible without any manual post-processing.

I've updated benchmark.sh to use the following binary:

https://clickhouse-builds.s3.amazonaws.com/PRs/81944/cda07f8aca770d97ea149eec6b477dcfd59d134e/build_amd_release/clickhouse-common-static-25.7.1.1-amd64.tgz

@rschu1ze @nickitat Could you help re-run the benchmarks and update the results on both c6a.metal and c6a.4xlarge, as those were mentioned in the Firebolt PR as the most commonly used environments?

Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants