@@ -23,16 +23,15 @@ This crate contains benchmarks based on popular public data sets and
23
23
open source benchmark suites, to help with performance and scalability
24
24
testing of DataFusion.
25
25
26
-
27
26
## Other engines
28
27
29
28
The benchmarks measure changes to DataFusion itself, rather than
30
29
its performance against other engines. For competitive benchmarking,
31
30
DataFusion is included in the benchmark setups for several popular
32
31
benchmarks that compare performance with other engines. For example:
33
32
34
- * [ ClickBench] scripts are in the [ ClickBench repo] ( https://github.com/ClickHouse/ClickBench/tree/main/datafusion )
35
- * [ H2o.ai ` db-benchmark ` ] scripts are in [ db-benchmark] ( https://github.com/apache/datafusion/tree/main/benchmarks/src/h2o.rs )
33
+ - [ ClickBench] scripts are in the [ ClickBench repo] ( https://github.com/ClickHouse/ClickBench/tree/main/datafusion )
34
+ - [ H2o.ai ` db-benchmark ` ] scripts are in [ db-benchmark] ( https://github.com/apache/datafusion/tree/main/benchmarks/src/h2o.rs )
36
35
37
36
[ ClickBench ] : https://github.com/ClickHouse/ClickBench/tree/main
38
37
[ H2o.ai `db-benchmark` ] : https://github.com/h2oai/db-benchmark
@@ -65,39 +64,50 @@ Create / download a specific dataset (TPCH)
65
64
``` shell
66
65
./bench.sh data tpch
67
66
```
67
+
68
68
Data is placed in the ` data ` subdirectory.
69
69
70
70
## Running benchmarks
71
71
72
72
Run benchmark for TPC-H dataset
73
+
73
74
``` shell
74
75
./bench.sh run tpch
75
76
```
77
+
76
78
or for TPC-H dataset scale 10
79
+
77
80
``` shell
78
81
./bench.sh run tpch10
79
82
```
80
83
81
84
To run for specific query, for example Q21
85
+
82
86
``` shell
83
87
./bench.sh run tpch10 21
84
88
```
85
89
86
90
## Benchmark with modified configurations
91
+
87
92
### Select join algorithm
93
+
88
94
The benchmark runs with ` prefer_hash_join == true ` by default, which enforces HASH join algorithm.
89
95
To run TPCH benchmarks with join other than HASH:
96
+
90
97
``` shell
91
98
PREFER_HASH_JOIN=false ./bench.sh run tpch
92
99
```
93
100
94
101
### Configure with environment variables
95
- Any [ datafusion options] ( https://datafusion.apache.org/user-guide/configs.html ) that are provided environment variables are
102
+
103
+ Any [ datafusion options] ( https://datafusion.apache.org/user-guide/configs.html ) that are provided environment variables are
96
104
also considered by the benchmarks.
97
- The following configuration runs the TPCH benchmark with datafusion configured to * not* repartition join keys.
105
+ The following configuration runs the TPCH benchmark with datafusion configured to _ not_ repartition join keys.
106
+
98
107
``` shell
99
108
DATAFUSION_OPTIMIZER_REPARTITION_JOINS=false ./bench.sh run tpch
100
109
```
110
+
101
111
You might want to adjust the results location to avoid overwriting previous results.
102
112
Environment configuration that was picked up by datafusion is logged at ` info ` level.
103
113
To verify that datafusion picked up your configuration, run the benchmarks with ` RUST_LOG=info ` or higher.
@@ -419,7 +429,7 @@ logs.
419
429
420
430
Example
421
431
422
- dfbench parquet-filter --path ./data --scale-factor 1.0
432
+ dfbench parquet-filter --path ./data --scale-factor 1.0
423
433
424
434
generates the synthetic dataset at ` ./data/logs.parquet ` . The size
425
435
of the dataset can be controlled through the ` size_factor `
@@ -451,6 +461,7 @@ Iteration 2 returned 1781686 rows in 1947 ms
451
461
```
452
462
453
463
## Sort
464
+
454
465
Test performance of sorting large datasets
455
466
456
467
This test sorts a a synthetic dataset generated during the
@@ -474,22 +485,27 @@ Additionally, an optional `--limit` flag is available for the sort benchmark. Wh
474
485
See [ ` sort_tpch.rs ` ] ( src/sort_tpch.rs ) for more details.
475
486
476
487
### Sort TPCH Benchmark Example Runs
488
+
477
489
1 . Run all queries with default setting:
490
+
478
491
``` bash
479
492
cargo run --release --bin dfbench -- sort-tpch -p ' ./datafusion/benchmarks/data/tpch_sf1' -o ' /tmp/sort_tpch.json'
480
493
```
481
494
482
495
2 . Run a specific query:
496
+
483
497
``` bash
484
498
cargo run --release --bin dfbench -- sort-tpch -p ' ./datafusion/benchmarks/data/tpch_sf1' -o ' /tmp/sort_tpch.json' --query 2
485
499
```
486
500
487
501
3 . Run all queries as TopK queries on presorted data:
502
+
488
503
``` bash
489
504
cargo run --release --bin dfbench -- sort-tpch --sorted --limit 10 -p ' ./datafusion/benchmarks/data/tpch_sf1' -o ' /tmp/sort_tpch.json'
490
505
```
491
506
492
507
4 . Run all queries with ` bench.sh ` script:
508
+
493
509
``` bash
494
510
./bench.sh run sort_tpch
495
511
```
@@ -527,73 +543,86 @@ External aggregation benchmarks run several aggregation queries with different m
527
543
This benchmark is inspired by [ DuckDB's external aggregation paper] ( https://hannes.muehleisen.org/publications/icde2024-out-of-core-kuiper-boncz-muehleisen.pdf ) , specifically Section VI.
528
544
529
545
### External Aggregation Example Runs
546
+
530
547
1 . Run all queries with predefined memory limits:
548
+
531
549
``` bash
532
550
# Under 'benchmarks/' directory
533
551
cargo run --release --bin external_aggr -- benchmark -n 4 --iterations 3 -p ' ....../data/tpch_sf1' -o ' /tmp/aggr.json'
534
552
```
535
553
536
554
2 . Run a query with specific memory limit:
555
+
537
556
``` bash
538
557
cargo run --release --bin external_aggr -- benchmark -n 4 --iterations 3 -p ' ....../data/tpch_sf1' -o ' /tmp/aggr.json' --query 1 --memory-limit 30M
539
558
```
540
559
541
560
3 . Run all queries with ` bench.sh ` script:
561
+
542
562
``` bash
543
563
./bench.sh data external_aggr
544
564
./bench.sh run external_aggr
545
565
```
546
566
547
-
548
567
## h2o.ai benchmarks
568
+
549
569
The h2o.ai benchmarks are a set of performance tests for groupby and join operations. Beyond the standard h2o benchmark, there is also an extended benchmark for window functions. These benchmarks use synthetic data with configurable sizes (small: 1e7 rows, medium: 1e8 rows, big: 1e9 rows) to evaluate DataFusion's performance across different data scales.
550
570
551
571
Reference:
572
+
552
573
- [ H2O AI Benchmark] ( https://duckdb.org/2023/04/14/h2oai.html )
553
574
- [ Extended window benchmark] ( https://duckdb.org/2024/06/26/benchmarks-over-time.html#window-functions-benchmark )
554
575
555
576
### h2o benchmarks for groupby
556
577
557
578
#### Generate data for h2o benchmarks
579
+
558
580
There are three options for generating data for h2o benchmarks: ` small ` , ` medium ` , and ` big ` . The data is generated in the ` data ` directory.
559
581
560
582
1 . Generate small data (1e7 rows)
583
+
561
584
``` bash
562
585
./bench.sh data h2o_small
563
586
```
564
587
565
-
566
588
2 . Generate medium data (1e8 rows)
589
+
567
590
``` bash
568
591
./bench.sh data h2o_medium
569
592
```
570
593
571
-
572
594
3 . Generate large data (1e9 rows)
595
+
573
596
``` bash
574
597
./bench.sh data h2o_big
575
598
```
576
599
577
600
#### Run h2o benchmarks
601
+
578
602
There are three options for running h2o benchmarks: ` small ` , ` medium ` , and ` big ` .
603
+
579
604
1 . Run small data benchmark
605
+
580
606
``` bash
581
607
./bench.sh run h2o_small
582
608
```
583
609
584
610
2 . Run medium data benchmark
611
+
585
612
``` bash
586
613
./bench.sh run h2o_medium
587
614
```
588
615
589
616
3 . Run large data benchmark
617
+
590
618
``` bash
591
619
./bench.sh run h2o_big
592
620
```
593
621
594
622
4 . Run a specific query with a specific data path
595
623
596
624
For example, to run query 1 with the small data generated above:
625
+
597
626
``` bash
598
627
cargo run --release --bin dfbench -- h2o --path ./benchmarks/data/h2o/G1_1e7_1e7_100_0.csv --query 1
599
628
```
@@ -602,7 +631,7 @@ cargo run --release --bin dfbench -- h2o --path ./benchmarks/data/h2o/G1_1e7_1e7
602
631
603
632
There are three options for generating data for h2o benchmarks: ` small ` , ` medium ` , and ` big ` . The data is generated in the ` data ` directory.
604
633
605
- Here is a example to generate ` small ` dataset and run the benchmark. To run other
634
+ Here is a example to generate ` small ` dataset and run the benchmark. To run other
606
635
dataset size configuration, change the command similar to the previous example.
607
636
608
637
``` bash
@@ -616,6 +645,7 @@ dataset size configuration, change the command similar to the previous example.
616
645
To run a specific query with a specific join data paths, the data paths are including 4 table files.
617
646
618
647
For example, to run query 1 with the small data generated above:
648
+
619
649
``` bash
620
650
cargo run --release --bin dfbench -- h2o --join-paths ./benchmarks/data/h2o/J1_1e7_NA_0.csv,./benchmarks/data/h2o/J1_1e7_1e1_0.csv,./benchmarks/data/h2o/J1_1e7_1e4_0.csv,./benchmarks/data/h2o/J1_1e7_1e7_NA.csv --queries-path ./benchmarks/queries/h2o/join.sql --query 1
621
651
```
@@ -624,7 +654,7 @@ cargo run --release --bin dfbench -- h2o --join-paths ./benchmarks/data/h2o/J1_1
624
654
625
655
This benchmark extends the h2o benchmark suite to evaluate window function performance. H2o window benchmark uses the same dataset as the h2o join benchmark. There are three options for generating data for h2o benchmarks: ` small ` , ` medium ` , and ` big ` .
626
656
627
- Here is a example to generate ` small ` dataset and run the benchmark. To run other
657
+ Here is a example to generate ` small ` dataset and run the benchmark. To run other
628
658
dataset size configuration, change the command similar to the previous example.
629
659
630
660
``` bash
@@ -638,6 +668,7 @@ dataset size configuration, change the command similar to the previous example.
638
668
To run a specific query with a specific window data paths, the data paths are including 4 table files (the same as h2o-join dataset)
639
669
640
670
For example, to run query 1 with the small data generated above:
671
+
641
672
``` bash
642
673
cargo run --release --bin dfbench -- h2o --join-paths ./benchmarks/data/h2o/J1_1e7_NA_0.csv,./benchmarks/data/h2o/J1_1e7_1e1_0.csv,./benchmarks/data/h2o/J1_1e7_1e4_0.csv,./benchmarks/data/h2o/J1_1e7_1e7_NA.csv --queries-path ./benchmarks/queries/h2o/window.sql --query 1
643
674
```
0 commit comments