Skip to content

[DISCUSSION] Make DataFusion the fastest engine for querying parquet data in ClickBench #12821

Closed
@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

I am mostly writing this up to record what I think is an ongoing work with @jayzhan211 @Rachelint @korowa and myself

TLDR, we are working on (and getting pretty close) to having DataFusion be the fastest single node engine for querying parquet files in ClickBench

Background:

https://benchmark.clickhouse.com/ shows the results of ClickBench

ClickBench the benchmark and is described here https://github.com/ClickHouse/ClickBench. I am not personally interested in proprietary file formats that require special loading

Here is the current leaderboard for partitioned parquet reflecting DataFusion 40.0.0:

Screenshot 2024-10-08 at 4 45 16 PM

Describe the solution you'd like

I would like DataFusion to be the fastest

Describe alternatives you've considered

No response

Additional context

This is also inspired by @ozankabak 's call to action on #11442

The scripts to run with datafusion are here: https://github.com/ClickHouse/ClickBench/tree/main/datafusion

Last update is here: ClickHouse/ClickBench#210

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions