Add chdb solution #131

wudidapaopao · 2025-09-10T05:31:28Z

This PR adds the chdb solution.

I've updated the test results for the 0.5G and 5G datasets, which were run on my local laptop, in time.csv and logs.csv. These results were not generated on a standard machine. They are only intended to demonstrate the final output of the test script.

Please feel free to contact me if there are any issues or missing information.

The cbdb implementation is based on polars python code. Current approach is using connection + cursor.

Draft for joining code is finished

Grouping queries were prepared and also added flush=True for all printing operations with LIMIT 3

Now groupings code is working

fix nc field

Initial working version of chdb for join

Proper logic for on_disk and na_flag identification was added.

fix for local run

Added settings and fix mergeTree for session branch

added latest query and max_threads

Added settings for threads

on M1 Pro chip - 8 threads showing better perf

Tmonster

Great thanks! Just some comments. You can remove your changes to time.csv and logs.csv. I'll run the benchmark again when this gets merged 👍

Tmonster · 2025-09-10T15:41:54Z

chdb/groupby-chdb.py

+conn.query("DROP TABLE IF EXISTS ans")
+gc.collect()
+if compress:
+    time.sleep(60)


why are we putting a sleep between the two queries?

The processing logic of chDB is similar to that of ClickHouse, so here, just like with ClickHouse, the same handling has been applied for the 50G large dataset to avoid OOM.
I added the following comments in the test script:

# It will take some time for memory freed by Memory engine to be returned back to the system. # Without a sleep we might get a MEMORY_LIMIT exception during the second run of the query. # It is done only when compress is true because this variable is set to true only for the largest dataset.

Tmonster · 2025-09-10T15:42:52Z

chdb/groupby-chdb.py

+QUERY=f"CREATE TABLE ans ENGINE = {query_engine} AS SELECT id1, sum(v1) AS v1 FROM db_benchmark.x GROUP BY id1 {settings}"
+conn.query(QUERY)
+nr = int(str(conn.query("SELECT count(*) AS cnt FROM ans")).strip())
+nc = len(str(conn.query("SELECT * FROM ans LIMIT 0", "CSVWITHNAMES")).split(','))


What is the significant of "CSVWITHnAMES" here? Just curious

Oh, is it just a way to get the number of columns?

Yes, it's just to be able to get the number of columns.

Tmonster · 2025-09-10T15:46:02Z

chdb/monitor_ram.py

+import time
+import sys
+
+solution = str(sys.argv[1])


does this file need to be included? Maybe it could be in utils with the idea all solutions can use it?

This file should be unused and identical to monitor_ram.py in the polars directory, so I have deleted it.

wudidapaopao · 2025-09-10T17:27:36Z

Great thanks! Just some comments. You can remove your changes to time.csv and logs.csv. I'll run the benchmark again when this gets merged 👍

Thank you very much. I have reverted the changes made to time.csv and logs.csv.

cyrusmsk and others added 14 commits July 27, 2025 18:05

Initial chdb implementation

c409f96

The cbdb implementation is based on polars python code. Current approach is using connection + cursor.

Update join-chdb.py

e3cd2a5

Draft for joining code is finished

Grouping queries

809401f

Grouping queries were prepared and also added flush=True for all printing operations with LIMIT 3

[chDB] Groupings code fixed

ae5c96c

Now groupings code is working

Update groupby-chdb.py

8ccf54d

fix nc field

Update join-chdb.py

34d2b83

Initial working version of chdb for join

On_disk and na_flag logic added

a5648c1

Proper logic for on_disk and na_flag identification was added.

Update groupby-chdb.py

03d8978

fix for local run

Update groupby-chdb.py

6b66e05

Added settings and fix mergeTree for session branch

Update groupby-chdb.py

6a5be4c

added latest query and max_threads

Fix settings for session branch

4f72c44

Added settings for threads

Optimal number of threads

90cff38

on M1 Pro chip - 8 threads showing better perf

Update chdb solution

a47ea72

Add local test results

9606449

szarnyasg requested a review from Tmonster September 10, 2025 06:16

Tmonster reviewed Sep 10, 2025

View reviewed changes

Remove chdb/monitor_ram.py and update the test script

76a834f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add chdb solution #131

Add chdb solution #131

Uh oh!

wudidapaopao commented Sep 10, 2025 •

edited

Loading

Uh oh!

Tmonster left a comment

Uh oh!

Tmonster Sep 10, 2025

Uh oh!

wudidapaopao Sep 10, 2025 •

edited

Loading

Uh oh!

Tmonster Sep 10, 2025

Uh oh!

Tmonster Sep 10, 2025

Uh oh!

wudidapaopao Sep 10, 2025

Uh oh!

Tmonster Sep 10, 2025

Uh oh!

wudidapaopao Sep 10, 2025

Uh oh!

wudidapaopao commented Sep 10, 2025

Uh oh!

Uh oh!

Add chdb solution #131

Are you sure you want to change the base?

Add chdb solution #131

Uh oh!

Conversation

wudidapaopao commented Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Tmonster left a comment

Choose a reason for hiding this comment

Uh oh!

Tmonster Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

wudidapaopao Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Tmonster Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

Tmonster Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

wudidapaopao Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

Tmonster Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

wudidapaopao Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

wudidapaopao commented Sep 10, 2025

Uh oh!

Uh oh!

wudidapaopao commented Sep 10, 2025 •

edited

Loading

wudidapaopao Sep 10, 2025 •

edited

Loading