Skip to content

Blog post about parquet vs custom file formats #16149

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Tracked by #14836
alamb opened this issue May 22, 2025 · 3 comments
Open
Tracked by #14836

Blog post about parquet vs custom file formats #16149

alamb opened this issue May 22, 2025 · 3 comments
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented May 22, 2025

Is your feature request related to a problem or challenge?

https://x.com/andrewlamb1111/status/1925537738360504663

ClickBench keeps me convinced that Parquet can be quite fast. There is only a 2.3x performance difference vs @duckdb 's own format and unoptimized parquet: https://tinyurl.com/5aexvsfw. I am surprised that the (closed source) Umbra only reports 3.3x faster than DuckDB on parquet

Describe the solution you'd like

I would love to make a blog post about how much faster/slower custom file formats are compared to parquet. I am typing this ticket now that it is on my mind so I don't forget it.

The basic thesis is that

  • Custom file formats only get you XX% more performance than parquet
  • Many of the historic performance differences are due to engineering investment rather than format
  • Parquet has many other benefits (like a very large ecosystem)

==> therefore parquet is the format that really matters

Describe alternatives you've considered

The core of the post would be to compare

  1. A propretary format (like duckdb/umbra)
  2. normal parquet
  3. "optimized parquet"

I think we could basically use the https://github.com/ClickHouse/ClickBench dataset and queries (and results from the proprietary systems)

The thing that is needed is to generate "optimized parquet" numbers.

The partitioned parquet files from ClickBench are not optimized. Specifically they:

  1. Are not sorted in any way
  2. Do not have a page index (Offset index)
  3. Use snappy compression

A fun experiment might be to "fix" the clickbench partitioned dataset by

  1. resorting and writing with page indexes (could use a bunch of DataFusion COPY commands pretty easily to do this). The sort order should be some subset of the predicate columns. Perhaps EventTime and then maybe SearchPhrase / URL.
  2. disabling compression

Additional context

No response

@alamb alamb added the enhancement New feature or request label May 22, 2025
@alamb alamb mentioned this issue May 22, 2025
8 tasks
@Dandandan
Copy link
Contributor

Dandandan commented May 22, 2025

Interestingly, Clickbench being quite a bit faster again for 1.3 (ClickHouse/ClickBench#376 ) seems mostly related to using predicate pushdown more effectively during Parquet decoding (which they already might have implemented for their own format).

@alamb
Copy link
Contributor Author

alamb commented May 22, 2025

Interestingly, Clickbench being quite a bit faster again for 1.3 (ClickHouse/ClickBench#376 ) seems mostly related to using predicate pushdown more effectively during Parquet decoding (which they already might have implemented for their own format).

Indeed -- unsurprisingly the more effort that is put into parquet readers the faster they go 😆 and the open nature / wide spread adoption of the format makes it easier to gather that required effort.

BTW, I am working on the same for DataFusion with @zhuqi-lucas in apache/arrow-rs#7456

I hope we will have some major improvements to share in another week or two

@zhuqi-lucas
Copy link
Contributor

zhuqi-lucas commented May 23, 2025

A fun experiment might be to "fix" the clickbench partitioned dataset by

resorting and writing with page indexes (could use a bunch of DataFusion COPY commands pretty easily to do this). The sort order should be some subset of the predicate columns. Perhaps EventTime and then maybe SearchPhrase / URL.
disabling compression

This is very interesting, maybe we can also do this for arrow-rs clickbench benchmark to see the result.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants