Description
Is your feature request related to a problem or challenge?
I am mostly writing this up to record what I think is an ongoing work with @jayzhan211 @Rachelint @korowa and myself
TLDR, we are working on (and getting pretty close) to having DataFusion be the fastest single node engine for querying parquet files in ClickBench
Background:
https://benchmark.clickhouse.com/ shows the results of ClickBench
ClickBench the benchmark and is described here https://github.com/ClickHouse/ClickBench. I am not personally interested in proprietary file formats that require special loading
Here is the current leaderboard for partitioned parquet reflecting DataFusion 40.0.0:
Describe the solution you'd like
I would like DataFusion to be the fastest
Describe alternatives you've considered
No response
Additional context
This is also inspired by @ozankabak 's call to action on #11442
The scripts to run with datafusion are here: https://github.com/ClickHouse/ClickBench/tree/main/datafusion
Last update is here: ClickHouse/ClickBench#210