You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
ClickBench keeps me convinced that Parquet can be quite fast. There is only a 2.3x performance difference vs @duckdb 's own format and unoptimized parquet: https://tinyurl.com/5aexvsfw. I am surprised that the (closed source) Umbra only reports 3.3x faster than DuckDB on parquet
Describe the solution you'd like
I would love to make a blog post about how much faster/slower custom file formats are compared to parquet. I am typing this ticket now that it is on my mind so I don't forget it.
The basic thesis is that
Custom file formats only get you XX% more performance than parquet
Many of the historic performance differences are due to engineering investment rather than format
Parquet has many other benefits (like a very large ecosystem)
==> therefore parquet is the format that really matters
A fun experiment might be to "fix" the clickbench partitioned dataset by
resorting and writing with page indexes (could use a bunch of DataFusion COPY commands pretty easily to do this). The sort order should be some subset of the predicate columns. Perhaps EventTime and then maybe SearchPhrase / URL.
disabling compression
Additional context
No response
The text was updated successfully, but these errors were encountered:
Interestingly, Clickbench being quite a bit faster again for 1.3 (ClickHouse/ClickBench#376 ) seems mostly related to using predicate pushdown more effectively during Parquet decoding (which they already might have implemented for their own format).
Interestingly, Clickbench being quite a bit faster again for 1.3 (ClickHouse/ClickBench#376 ) seems mostly related to using predicate pushdown more effectively during Parquet decoding (which they already might have implemented for their own format).
Indeed -- unsurprisingly the more effort that is put into parquet readers the faster they go 😆 and the open nature / wide spread adoption of the format makes it easier to gather that required effort.
A fun experiment might be to "fix" the clickbench partitioned dataset by
resorting and writing with page indexes (could use a bunch of DataFusion COPY commands pretty easily to do this). The sort order should be some subset of the predicate columns. Perhaps EventTime and then maybe SearchPhrase / URL.
disabling compression
This is very interesting, maybe we can also do this for arrow-rs clickbench benchmark to see the result.
Is your feature request related to a problem or challenge?
https://x.com/andrewlamb1111/status/1925537738360504663
Describe the solution you'd like
I would love to make a blog post about how much faster/slower custom file formats are compared to parquet. I am typing this ticket now that it is on my mind so I don't forget it.
The basic thesis is that
==> therefore parquet is the format that really matters
Describe alternatives you've considered
The core of the post would be to compare
I think we could basically use the https://github.com/ClickHouse/ClickBench dataset and queries (and results from the proprietary systems)
The thing that is needed is to generate "optimized parquet" numbers.
The partitioned parquet files from ClickBench are not optimized. Specifically they:
A fun experiment might be to "fix" the clickbench partitioned dataset by
COPY
commands pretty easily to do this). The sort order should be some subset of the predicate columns. Perhaps EventTime and then maybe SearchPhrase / URL.Additional context
No response
The text was updated successfully, but these errors were encountered: