Skip to content

Commit ca46932

Browse files
logan-keedealamb
andauthored
minor: Add benchmark query and corresponding documentation for Average Duration (#16105)
* ADD query and documentation * Prettier --------- Co-authored-by: Andrew Lamb <[email protected]>
1 parent 2ea1e95 commit ca46932

File tree

3 files changed

+103
-34
lines changed

3 files changed

+103
-34
lines changed

benchmarks/README.md

Lines changed: 42 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -23,16 +23,15 @@ This crate contains benchmarks based on popular public data sets and
2323
open source benchmark suites, to help with performance and scalability
2424
testing of DataFusion.
2525

26-
2726
## Other engines
2827

2928
The benchmarks measure changes to DataFusion itself, rather than
3029
its performance against other engines. For competitive benchmarking,
3130
DataFusion is included in the benchmark setups for several popular
3231
benchmarks that compare performance with other engines. For example:
3332

34-
* [ClickBench] scripts are in the [ClickBench repo](https://github.com/ClickHouse/ClickBench/tree/main/datafusion)
35-
* [H2o.ai `db-benchmark`] scripts are in [db-benchmark](https://github.com/apache/datafusion/tree/main/benchmarks/src/h2o.rs)
33+
- [ClickBench] scripts are in the [ClickBench repo](https://github.com/ClickHouse/ClickBench/tree/main/datafusion)
34+
- [H2o.ai `db-benchmark`] scripts are in [db-benchmark](https://github.com/apache/datafusion/tree/main/benchmarks/src/h2o.rs)
3635

3736
[ClickBench]: https://github.com/ClickHouse/ClickBench/tree/main
3837
[H2o.ai `db-benchmark`]: https://github.com/h2oai/db-benchmark
@@ -65,39 +64,50 @@ Create / download a specific dataset (TPCH)
6564
```shell
6665
./bench.sh data tpch
6766
```
67+
6868
Data is placed in the `data` subdirectory.
6969

7070
## Running benchmarks
7171

7272
Run benchmark for TPC-H dataset
73+
7374
```shell
7475
./bench.sh run tpch
7576
```
77+
7678
or for TPC-H dataset scale 10
79+
7780
```shell
7881
./bench.sh run tpch10
7982
```
8083

8184
To run for specific query, for example Q21
85+
8286
```shell
8387
./bench.sh run tpch10 21
8488
```
8589

8690
## Benchmark with modified configurations
91+
8792
### Select join algorithm
93+
8894
The benchmark runs with `prefer_hash_join == true` by default, which enforces HASH join algorithm.
8995
To run TPCH benchmarks with join other than HASH:
96+
9097
```shell
9198
PREFER_HASH_JOIN=false ./bench.sh run tpch
9299
```
93100

94101
### Configure with environment variables
95-
Any [datafusion options](https://datafusion.apache.org/user-guide/configs.html) that are provided environment variables are
102+
103+
Any [datafusion options](https://datafusion.apache.org/user-guide/configs.html) that are provided environment variables are
96104
also considered by the benchmarks.
97-
The following configuration runs the TPCH benchmark with datafusion configured to *not* repartition join keys.
105+
The following configuration runs the TPCH benchmark with datafusion configured to _not_ repartition join keys.
106+
98107
```shell
99108
DATAFUSION_OPTIMIZER_REPARTITION_JOINS=false ./bench.sh run tpch
100109
```
110+
101111
You might want to adjust the results location to avoid overwriting previous results.
102112
Environment configuration that was picked up by datafusion is logged at `info` level.
103113
To verify that datafusion picked up your configuration, run the benchmarks with `RUST_LOG=info` or higher.
@@ -419,7 +429,7 @@ logs.
419429

420430
Example
421431

422-
dfbench parquet-filter --path ./data --scale-factor 1.0
432+
dfbench parquet-filter --path ./data --scale-factor 1.0
423433

424434
generates the synthetic dataset at `./data/logs.parquet`. The size
425435
of the dataset can be controlled through the `size_factor`
@@ -451,6 +461,7 @@ Iteration 2 returned 1781686 rows in 1947 ms
451461
```
452462

453463
## Sort
464+
454465
Test performance of sorting large datasets
455466

456467
This test sorts a a synthetic dataset generated during the
@@ -474,22 +485,27 @@ Additionally, an optional `--limit` flag is available for the sort benchmark. Wh
474485
See [`sort_tpch.rs`](src/sort_tpch.rs) for more details.
475486

476487
### Sort TPCH Benchmark Example Runs
488+
477489
1. Run all queries with default setting:
490+
478491
```bash
479492
cargo run --release --bin dfbench -- sort-tpch -p './datafusion/benchmarks/data/tpch_sf1' -o '/tmp/sort_tpch.json'
480493
```
481494

482495
2. Run a specific query:
496+
483497
```bash
484498
cargo run --release --bin dfbench -- sort-tpch -p './datafusion/benchmarks/data/tpch_sf1' -o '/tmp/sort_tpch.json' --query 2
485499
```
486500

487501
3. Run all queries as TopK queries on presorted data:
502+
488503
```bash
489504
cargo run --release --bin dfbench -- sort-tpch --sorted --limit 10 -p './datafusion/benchmarks/data/tpch_sf1' -o '/tmp/sort_tpch.json'
490505
```
491506

492507
4. Run all queries with `bench.sh` script:
508+
493509
```bash
494510
./bench.sh run sort_tpch
495511
```
@@ -527,73 +543,86 @@ External aggregation benchmarks run several aggregation queries with different m
527543
This benchmark is inspired by [DuckDB's external aggregation paper](https://hannes.muehleisen.org/publications/icde2024-out-of-core-kuiper-boncz-muehleisen.pdf), specifically Section VI.
528544

529545
### External Aggregation Example Runs
546+
530547
1. Run all queries with predefined memory limits:
548+
531549
```bash
532550
# Under 'benchmarks/' directory
533551
cargo run --release --bin external_aggr -- benchmark -n 4 --iterations 3 -p '....../data/tpch_sf1' -o '/tmp/aggr.json'
534552
```
535553

536554
2. Run a query with specific memory limit:
555+
537556
```bash
538557
cargo run --release --bin external_aggr -- benchmark -n 4 --iterations 3 -p '....../data/tpch_sf1' -o '/tmp/aggr.json' --query 1 --memory-limit 30M
539558
```
540559

541560
3. Run all queries with `bench.sh` script:
561+
542562
```bash
543563
./bench.sh data external_aggr
544564
./bench.sh run external_aggr
545565
```
546566

547-
548567
## h2o.ai benchmarks
568+
549569
The h2o.ai benchmarks are a set of performance tests for groupby and join operations. Beyond the standard h2o benchmark, there is also an extended benchmark for window functions. These benchmarks use synthetic data with configurable sizes (small: 1e7 rows, medium: 1e8 rows, big: 1e9 rows) to evaluate DataFusion's performance across different data scales.
550570

551571
Reference:
572+
552573
- [H2O AI Benchmark](https://duckdb.org/2023/04/14/h2oai.html)
553574
- [Extended window benchmark](https://duckdb.org/2024/06/26/benchmarks-over-time.html#window-functions-benchmark)
554575

555576
### h2o benchmarks for groupby
556577

557578
#### Generate data for h2o benchmarks
579+
558580
There are three options for generating data for h2o benchmarks: `small`, `medium`, and `big`. The data is generated in the `data` directory.
559581

560582
1. Generate small data (1e7 rows)
583+
561584
```bash
562585
./bench.sh data h2o_small
563586
```
564587

565-
566588
2. Generate medium data (1e8 rows)
589+
567590
```bash
568591
./bench.sh data h2o_medium
569592
```
570593

571-
572594
3. Generate large data (1e9 rows)
595+
573596
```bash
574597
./bench.sh data h2o_big
575598
```
576599

577600
#### Run h2o benchmarks
601+
578602
There are three options for running h2o benchmarks: `small`, `medium`, and `big`.
603+
579604
1. Run small data benchmark
605+
580606
```bash
581607
./bench.sh run h2o_small
582608
```
583609

584610
2. Run medium data benchmark
611+
585612
```bash
586613
./bench.sh run h2o_medium
587614
```
588615

589616
3. Run large data benchmark
617+
590618
```bash
591619
./bench.sh run h2o_big
592620
```
593621

594622
4. Run a specific query with a specific data path
595623

596624
For example, to run query 1 with the small data generated above:
625+
597626
```bash
598627
cargo run --release --bin dfbench -- h2o --path ./benchmarks/data/h2o/G1_1e7_1e7_100_0.csv --query 1
599628
```
@@ -602,7 +631,7 @@ cargo run --release --bin dfbench -- h2o --path ./benchmarks/data/h2o/G1_1e7_1e7
602631

603632
There are three options for generating data for h2o benchmarks: `small`, `medium`, and `big`. The data is generated in the `data` directory.
604633

605-
Here is a example to generate `small` dataset and run the benchmark. To run other
634+
Here is a example to generate `small` dataset and run the benchmark. To run other
606635
dataset size configuration, change the command similar to the previous example.
607636

608637
```bash
@@ -616,6 +645,7 @@ dataset size configuration, change the command similar to the previous example.
616645
To run a specific query with a specific join data paths, the data paths are including 4 table files.
617646

618647
For example, to run query 1 with the small data generated above:
648+
619649
```bash
620650
cargo run --release --bin dfbench -- h2o --join-paths ./benchmarks/data/h2o/J1_1e7_NA_0.csv,./benchmarks/data/h2o/J1_1e7_1e1_0.csv,./benchmarks/data/h2o/J1_1e7_1e4_0.csv,./benchmarks/data/h2o/J1_1e7_1e7_NA.csv --queries-path ./benchmarks/queries/h2o/join.sql --query 1
621651
```
@@ -624,7 +654,7 @@ cargo run --release --bin dfbench -- h2o --join-paths ./benchmarks/data/h2o/J1_1
624654

625655
This benchmark extends the h2o benchmark suite to evaluate window function performance. H2o window benchmark uses the same dataset as the h2o join benchmark. There are three options for generating data for h2o benchmarks: `small`, `medium`, and `big`.
626656

627-
Here is a example to generate `small` dataset and run the benchmark. To run other
657+
Here is a example to generate `small` dataset and run the benchmark. To run other
628658
dataset size configuration, change the command similar to the previous example.
629659

630660
```bash
@@ -638,6 +668,7 @@ dataset size configuration, change the command similar to the previous example.
638668
To run a specific query with a specific window data paths, the data paths are including 4 table files (the same as h2o-join dataset)
639669

640670
For example, to run query 1 with the small data generated above:
671+
641672
```bash
642673
cargo run --release --bin dfbench -- h2o --join-paths ./benchmarks/data/h2o/J1_1e7_NA_0.csv,./benchmarks/data/h2o/J1_1e7_1e1_0.csv,./benchmarks/data/h2o/J1_1e7_1e4_0.csv,./benchmarks/data/h2o/J1_1e7_1e7_NA.csv --queries-path ./benchmarks/queries/h2o/window.sql --query 1
643674
```

0 commit comments

Comments
 (0)