You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The basic idea is that you have a parquet file with some projections and a TopK sort on some (ideally small) subset of those projections. So you can:
Read the columns required for the topk sort along with their row offsets
Build the topk and discard everything else
Use the rowids from the topk rows to build a RowSelection to read remaining columns
Read remaining columns using row selection.
The current implementation of parquet reader can't support this if you have row filters you are pushing down to the scan since the offset of rows from the scan in 1 will not align with the offset of rows in the file.
But it is relatively straightforward to keep track of the offsets during scan and just return them.
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Add a method
that, will project a column with name
field_name
into the output of the reader that contains the row offset in the parquet file of each rowDescribe the solution you'd like
Prototype implementation can be found here coralogix@3d4a09f
If this seems like something we can merge upstream I can create a PR to master in the upstream repo
Describe alternatives you've considered
Not do it :)
Additional context
I'm trying to implement something like https://clickhouse.com/blog/clickhouse-gets-lazier-and-faster-introducing-lazy-materialization in a way that does not require re-scanning metadata or re-scanning fields that have already been read and decoded.
The basic idea is that you have a parquet file with some projections and a TopK sort on some (ideally small) subset of those projections. So you can:
RowSelection
to read remaining columnsThe current implementation of parquet reader can't support this if you have row filters you are pushing down to the scan since the offset of rows from the scan in
1
will not align with the offset of rows in the file.But it is relatively straightforward to keep track of the offsets during scan and just return them.
The text was updated successfully, but these errors were encountered: