-
Notifications
You must be signed in to change notification settings - Fork 22
tpcbench.py add --query support to run custom query #84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -31,7 +31,7 @@ def tpch_query(qnum: int) -> str: | |||||
|
||||||
|
||||||
def main( | ||||||
qnum: int, | ||||||
queries: list[(str, str)], | ||||||
data_path: str, | ||||||
concurrency: int, | ||||||
batch_size: int, | ||||||
|
@@ -99,10 +99,7 @@ def main( | |||||
if validate: | ||||||
results["validated"] = {} | ||||||
|
||||||
queries = range(1, 23) if qnum == -1 else [qnum] | ||||||
for qnum in queries: | ||||||
sql = tpch_query(qnum) | ||||||
|
||||||
for (qid, sql) in queries: | ||||||
statements = list( | ||||||
filter(lambda x: len(x) > 0, map(lambda x: x.strip(), sql.split(";"))) | ||||||
) | ||||||
|
@@ -115,7 +112,7 @@ def main( | |||||
df = ctx.sql(sql) | ||||||
all_batches.append(df.collect()) | ||||||
end_time = time.time() | ||||||
results["queries"][qnum] = end_time - start_time | ||||||
results["queries"][qid] = end_time - start_time | ||||||
|
||||||
calculated = "\n".join([prettify(b) for b in all_batches]) | ||||||
print(calculated) | ||||||
|
@@ -125,8 +122,8 @@ def main( | |||||
all_batches.append(local.collect_sql(sql)) | ||||||
expected = "\n".join([prettify(b) for b in all_batches]) | ||||||
|
||||||
results["validated"][qnum] = calculated == expected | ||||||
print(f"done with query {qnum}") | ||||||
results["validated"][qid] = calculated == expected | ||||||
print(f"done with query {qid}") | ||||||
|
||||||
# write the results as we go, so you can peek at them | ||||||
results_dump = json.dumps(results, indent=4) | ||||||
|
@@ -154,7 +151,10 @@ def main( | |||||
parser.add_argument( | ||||||
"--concurrency", required=True, help="Number of concurrent tasks" | ||||||
) | ||||||
parser.add_argument("--qnum", type=int, default=-1, help="TPCH query number, 1-22") | ||||||
parser.add_argument("--qnum", type=int, default=-1, | ||||||
help="TPCH query number, 1-22") | ||||||
parser.add_argument("--query", required=False, type=str, | ||||||
help="Custom query to run with tpch tables") | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
parser.add_argument("--listing-tables", action="store_true") | ||||||
parser.add_argument("--validate", action="store_true") | ||||||
parser.add_argument( | ||||||
|
@@ -186,8 +186,28 @@ def main( | |||||
|
||||||
args = parser.parse_args() | ||||||
|
||||||
if (args.qnum != -1 and args.query is not None): | ||||||
print("Please specify either --qnum or --query, but not both") | ||||||
exit(1) | ||||||
|
||||||
queries = [] | ||||||
if (args.qnum != -1): | ||||||
if args.qnum < 1 or args.qnum > 22: | ||||||
print("Invalid query number. Please specify a number between 1 and 22.") | ||||||
exit(1) | ||||||
else: | ||||||
queries.append((str(args.qnum), tpch_query(args.qnum))) | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. explicitly mention TPCH in the id
Suggested change
|
||||||
print("Executing tpch query ", args.qnum) | ||||||
|
||||||
elif (args.query is not None): | ||||||
queries.append(("custom query", args.query)) | ||||||
print("Executing custom query: ", args.query) | ||||||
else: | ||||||
print("Executing all tpch queries") | ||||||
queries = [(str(i), tpch_query(i)) for i in range(1, 23)] | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
||||||
Comment on lines
+189
to
+208
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. minor suggestion, extract this into its own functions, for example from typing import List
def get_sql_queries(tpch_qnum: str = None, sql_statement: str= None) -> List[(str, str)]:
"""
Get the list of SQL statements from either the TPCH or user provided SQL statements.
At most one of these parameters can be provided.
:param tpch_qnum: the TPCH Query number. If none, return all TPCH queries supported
:param sql_statement: SQL string statement on available data tables (e.g ingested through make_data.py)
:return: a list of tuples with name of the Query and the string SQL statement
""" |
||||||
main( | ||||||
args.qnum, | ||||||
queries, | ||||||
args.data, | ||||||
int(args.concurrency), | ||||||
int(args.batch_size), | ||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would recommend standardizing the data file directory to testdata/tpch and add the correct make_file.py command just above, for example
add before this more documentation one make_file
tpch
directory, usemake_data.py
to create a TPCH dataset at a provided scale factor and an output director, such as thetestdata
directorypython make_data.py 1 "../testdata/tpch"
could also specify a env variable for this in the setup
TPCH_DATA=../testdata/tpch
and replace the examples with
$TPCH_DATA