@@ -85,7 +85,7 @@ git checkout main
85
85
# Gather baseline data for tpch benchmark
86
86
./benchmarks/bench.sh run tpch
87
87
88
- # Switch to the branch the branch name is mybranch and gather data
88
+ # Switch to the branch named mybranch and gather data
89
89
git checkout mybranch
90
90
./benchmarks/bench.sh run tpch
91
91
@@ -157,22 +157,19 @@ Benchmark tpch_mem.json
157
157
└──────────────┴──────────────┴──────────────┴───────────────┘
158
158
```
159
159
160
- Note that you can also execute an automatic comparison of the changes in a given PR against the base
161
- just by including the trigger ` /benchmark ` in any comment.
162
-
163
160
### Running Benchmarks Manually
164
161
165
- Assuming data in the ` data ` directory, the ` tpch ` benchmark can be run with a command like this
162
+ Assuming data is in the ` data ` directory, the ` tpch ` benchmark can be run with a command like this:
166
163
167
164
``` bash
168
165
cargo run --release --bin dfbench -- tpch --iterations 3 --path ./data --format tbl --query 1 --batch-size 4096
169
166
```
170
167
171
- See the help for more details
168
+ See the help for more details.
172
169
173
170
### Different features
174
171
175
- You can enable ` mimalloc ` or ` snmalloc ` (to use either the mimalloc or snmalloc allocator) as features by passing them in as ` --features ` . For example
172
+ You can enable ` mimalloc ` or ` snmalloc ` (to use either the mimalloc or snmalloc allocator) as features by passing them in as ` --features ` . For example:
176
173
177
174
``` shell
178
175
cargo run --release --features " mimalloc" --bin tpch -- benchmark datafusion --iterations 3 --path ./data --format tbl --query 1 --batch-size 4096
@@ -184,6 +181,7 @@ The benchmark program also supports CSV and Parquet input file formats and a uti
184
181
``` bash
185
182
cargo run --release --bin tpch -- convert --input ./data --output /mnt/tpch-parquet --format parquet
186
183
```
184
+
187
185
Or if you want to verify and run all the queries in the benchmark, you can just run ` cargo test ` .
188
186
189
187
### Comparing results between runs
@@ -206,7 +204,7 @@ $ cargo run --release --bin tpch -- benchmark datafusion --iterations 5 --path .
206
204
./compare.py /tmp/output_main/tpch-summary--1679330119.json /tmp/output_branch/tpch-summary--1679328405.json
207
205
```
208
206
209
- This will produce output like
207
+ This will produce output like:
210
208
211
209
```
212
210
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
@@ -243,28 +241,92 @@ The `dfbench` program contains subcommands to run the various
243
241
benchmarks. When benchmarking, it should always be built in release
244
242
mode using ` --release ` .
245
243
246
- Full help for each benchmark can be found in the relevant sub
247
- command . For example to get help for tpch, run
244
+ Full help for each benchmark can be found in the relevant
245
+ subcommand . For example, to get help for tpch, run:
248
246
249
247
``` shell
250
- cargo run --release --bin dfbench --help
248
+ cargo run --release --bin dfbench -- tpch --help
251
249
...
252
- datafusion-benchmarks 27.0.0
253
- benchmark command
250
+ dfbench-tpch 45.0.0
251
+ Run the tpch benchmark.
252
+
253
+ This benchmarks is derived from the [TPC-H][1] version
254
+ [2.17.1]. The data and answers are generated using ` tpch-gen` from
255
+ [2].
256
+
257
+ [1]: http://www.tpc.org/tpch/
258
+ [2]: https://github.com/databricks/tpch-dbgen.git,
259
+ [2.17.1]: https://www.tpc.org/tpc_documents_current_versions/pdf/tpc-h_v2.17.1.pdf
254
260
255
261
USAGE:
256
- dfbench < SUBCOMMAND>
262
+ dfbench tpch [FLAGS] [OPTIONS] --path < path>
263
+
264
+ FLAGS:
265
+ -d, --debug
266
+ Activate debug mode to see more details
257
267
258
- SUBCOMMANDS:
259
- clickbench Run the clickbench benchmark
260
- help Prints this message or the help of the given subcommand(s)
261
- parquet-filter Test performance of parquet filter pushdown
262
- sort Test performance of parquet filter pushdown
263
- tpch Run the tpch benchmark.
264
- tpch-convert Convert tpch .slt files to .parquet or .csv files
268
+ -S, --disable-statistics
269
+ Whether to disable collection of statistics (and cost based optimizations) or not
265
270
271
+ -h, --help
272
+ Prints help information
273
+ ...
266
274
```
267
275
276
+ # Writing a new benchmark
277
+
278
+ ## Creating or downloading data outside of the benchmark
279
+
280
+ If you want to create or download the data with Rust as part of running the benchmark, see the next
281
+ section on adding a benchmark subcommand and add code to create or download data as part of its
282
+ ` run ` function.
283
+
284
+ If you want to create or download the data with shell commands, in ` benchmarks/bench.sh ` , define a
285
+ new function named ` data_[your benchmark name] ` and call that function in the ` data ` command case
286
+ as a subcommand case named for your benchmark. Also call the new function in the ` data all ` case.
287
+
288
+ ## Adding the benchmark subcommand
289
+
290
+ In ` benchmarks/bench.sh ` , define a new function named ` run_[your benchmark name] ` following the
291
+ example of existing ` run_* ` functions. Call that function in the ` run ` command case as a subcommand
292
+ case named for your benchmark. subcommand for your benchmark. Also call the new function in the
293
+ ` run all ` case. Add documentation for your benchmark to the text in the ` usage ` function.
294
+
295
+ In ` benchmarks/src/bin/dfbench.rs ` , add a ` dfbench ` subcommand for your benchmark by:
296
+
297
+ - Adding a new variant to the ` Options ` enum
298
+ - Adding corresponding code to handle the new variant in the ` main ` function, similar to the other
299
+ variants
300
+ - Adding a module to the ` use datafusion_benchmarks::{} ` statement
301
+
302
+ In ` benchmarks/src/lib.rs ` , declare the new module you imported in ` dfbench.rs ` and create the
303
+ corresponding file(s) for the module's code.
304
+
305
+ In the module, following the pattern of other existing benchmarks, define a ` RunOpt ` struct with:
306
+
307
+ - A doc comment that will become the ` --help ` output for the subcommand
308
+ - A ` run ` method that the ` dfbench ` ` main ` function will call.
309
+ - A ` --path ` structopt field that the ` bench.sh ` script should use with ` ${DATA_DIR} ` to define
310
+ where the input data should be stored.
311
+ - An ` --output ` structopt field that the ` bench.sh ` script should use with ` "${RESULTS_FILE}" ` to
312
+ define where the benchmark's results should be stored.
313
+
314
+ ### Creating or downloading data as part of the benchmark
315
+
316
+ Use the ` --path ` structopt field defined on the ` RunOpt ` struct to know where to store or look for
317
+ the data. Generate the data using whatever Rust code you'd like, before the code that will be
318
+ measuring an operation.
319
+
320
+ ### Collecting data
321
+
322
+ Your benchmark should create and use an instance of ` BenchmarkRun ` defined in ` benchmarks/src/util/run.rs ` as follows:
323
+
324
+ - Call its ` start_new_case ` method with a string that will appear in the "Query" column of the
325
+ compare output.
326
+ - Use ` write_iter ` to record elapsed times for the behavior you're benchmarking.
327
+ - When all cases are done, call the ` BenchmarkRun ` 's ` maybe_write_json ` method, giving it the value
328
+ of the ` --output ` structopt field on ` RunOpt ` .
329
+
268
330
# Benchmarks
269
331
270
332
The output of ` dfbench ` help includes a description of each benchmark, which is reproduced here for convenience
0 commit comments