-
Notifications
You must be signed in to change notification settings - Fork 36
Add chdb solution #131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add chdb solution #131
Conversation
The cbdb implementation is based on polars python code. Current approach is using connection + cursor.
Draft for joining code is finished
Grouping queries were prepared and also added flush=True for all printing operations with LIMIT 3
Now groupings code is working
fix nc field
Initial working version of chdb for join
Proper logic for on_disk and na_flag identification was added.
fix for local run
Added settings and fix mergeTree for session branch
added latest query and max_threads
Added settings for threads
on M1 Pro chip - 8 threads showing better perf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great thanks! Just some comments. You can remove your changes to time.csv and logs.csv. I'll run the benchmark again when this gets merged 👍
conn.query("DROP TABLE IF EXISTS ans") | ||
gc.collect() | ||
if compress: | ||
time.sleep(60) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we putting a sleep between the two queries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The processing logic of chDB is similar to that of ClickHouse, so here, just like with ClickHouse, the same handling has been applied for the 50G large dataset to avoid OOM.
I added the following comments in the test script:
# It will take some time for memory freed by Memory engine to be returned back to the system.
# Without a sleep we might get a MEMORY_LIMIT exception during the second run of the query.
# It is done only when compress is true because this variable is set to true only for the largest dataset.
QUERY=f"CREATE TABLE ans ENGINE = {query_engine} AS SELECT id1, sum(v1) AS v1 FROM db_benchmark.x GROUP BY id1 {settings}" | ||
conn.query(QUERY) | ||
nr = int(str(conn.query("SELECT count(*) AS cnt FROM ans")).strip()) | ||
nc = len(str(conn.query("SELECT * FROM ans LIMIT 0", "CSVWITHNAMES")).split(',')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the significant of "CSVWITHnAMES" here? Just curious
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, is it just a way to get the number of columns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's just to be able to get the number of columns.
chdb/monitor_ram.py
Outdated
import time | ||
import sys | ||
|
||
solution = str(sys.argv[1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this file need to be included? Maybe it could be in utils with the idea all solutions can use it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file should be unused and identical to monitor_ram.py in the polars directory, so I have deleted it.
Thank you very much. I have reverted the changes made to time.csv and logs.csv. |
This PR adds the chdb solution.
I've updated the test results for the
0.5G
and5G
datasets, which were run on my local laptop, intime.csv
andlogs.csv
. These results were not generated on a standard machine. They are only intended to demonstrate the final output of the test script.Please feel free to contact me if there are any issues or missing information.