Skip to content

Conversation

wudidapaopao
Copy link

@wudidapaopao wudidapaopao commented Sep 10, 2025

This PR adds the chdb solution.

I've updated the test results for the 0.5G and 5G datasets, which were run on my local laptop, in time.csv and logs.csv. These results were not generated on a standard machine. They are only intended to demonstrate the final output of the test script.

Please feel free to contact me if there are any issues or missing information.

cyrusmsk and others added 14 commits July 27, 2025 18:05
The cbdb implementation is based on polars python code.
Current approach is using connection + cursor.
Draft for joining code is finished
Grouping queries were prepared and also added flush=True for all printing operations with LIMIT 3
Now groupings code is working
Initial working version of chdb for join
Proper logic for on_disk and na_flag identification was added.
fix for local run
Added settings and fix mergeTree for session branch
added latest query and max_threads
Added settings for threads
on M1 Pro chip - 8 threads showing better perf
@szarnyasg szarnyasg requested a review from Tmonster September 10, 2025 06:16
Copy link
Collaborator

@Tmonster Tmonster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great thanks! Just some comments. You can remove your changes to time.csv and logs.csv. I'll run the benchmark again when this gets merged 👍

conn.query("DROP TABLE IF EXISTS ans")
gc.collect()
if compress:
time.sleep(60)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we putting a sleep between the two queries?

Copy link
Author

@wudidapaopao wudidapaopao Sep 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The processing logic of chDB is similar to that of ClickHouse, so here, just like with ClickHouse, the same handling has been applied for the 50G large dataset to avoid OOM.
I added the following comments in the test script:

# It will take some time for memory freed by Memory engine to be returned back to the system.
# Without a sleep we might get a MEMORY_LIMIT exception during the second run of the query.
# It is done only when compress is true because this variable is set to true only for the largest dataset.

QUERY=f"CREATE TABLE ans ENGINE = {query_engine} AS SELECT id1, sum(v1) AS v1 FROM db_benchmark.x GROUP BY id1 {settings}"
conn.query(QUERY)
nr = int(str(conn.query("SELECT count(*) AS cnt FROM ans")).strip())
nc = len(str(conn.query("SELECT * FROM ans LIMIT 0", "CSVWITHNAMES")).split(','))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the significant of "CSVWITHnAMES" here? Just curious

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, is it just a way to get the number of columns?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's just to be able to get the number of columns.

import time
import sys

solution = str(sys.argv[1])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this file need to be included? Maybe it could be in utils with the idea all solutions can use it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file should be unused and identical to monitor_ram.py in the polars directory, so I have deleted it.

@wudidapaopao
Copy link
Author

Great thanks! Just some comments. You can remove your changes to time.csv and logs.csv. I'll run the benchmark again when this gets merged 👍

Thank you very much. I have reverted the changes made to time.csv and logs.csv.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants