Skip to content

Avoid buffering the entire row group in Async parquet reader #6946

Open
@alamb

Description

@alamb

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

The async_reader ParquetRecordBatchStreamBuilder is awesome and fast

However, it is currently quite aggressive when reading and will buffer an entire row group in memory during read. For data reorganization operations that read all columns, such merging multiple files together, we see significant memory used buffering this entire input

For example, we have some files at InfluxData that contain 1M rows each with 10 columns. Using the default WriterProperties with row group size 1M and 20k rows per page results in a parquet file with

  1. a single 1M row group
  2. 10 column chunks each with 50 pages.
  3. Each file is 60MB

This merging 10 such files requires 600MB of memory just to buffer the parquet data. 100 such files requires 6GB buffer, etc.

We can reduce the memory required by reducing the number of files merged concurrently (the fan-in) as well as reducing the number of rows in each row group. However, I also think there is potential improvement in the parquet reader itself to not buffer the entire file while reading

The root cause of this memory use is that all the pages of the current RowGroup are read in memory via the InMemoryRowGroup

In pictures, if we have a parquet file on disk/object store:

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─          ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─  ┃
┃    ┌──────────────┐  │     ┌──────────────┐  │            ┌──────────────┐  │ ┃
┃ │  │    Page 1    │     │  │    Page 1    │            │  │    Page 1    │    ┃
┃    └──────────────┘  │     └──────────────┘  │            └──────────────┘  │ ┃
┃ │  ┌──────────────┐     │  ┌──────────────┐            │  ┌──────────────┐    ┃
┃    │    Page 2    │  │     │    Page 2    │  │            │    Page 2    │  │ ┃
┃ │  └──────────────┘     │  └──────────────┘      ...   │  └──────────────┘    ┃
┃          ...         │           ...         │                  ...         │ ┃
┃ │                       │                              │                      ┃
┃    ┌──────────────┐  │     ┌──────────────┐  │            ┌──────────────┐  │ ┃
┃ │  │    Page N    │     │  │    Page N    │            │  │    Page N    │    ┃
┃    └──────────────┘  │     └──────────────┘  │            └──────────────┘  │ ┃
┃ └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─          └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─  ┃
┃      ColumnChunk             ColumnChunk                    ColumnChunk       ┃
┃                                                                               ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━Row Group 1━━┛

Reading it requires loading all pages from all columns read into InMemoryRowGroup

┏━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ┓ 
                                                                                    ┃ 
┃ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ 
┃ ┃ ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─          ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─  ┃   
┃ ┃    ┌──────────────┐  │     ┌──────────────┐  │            ┌──────────────┐  │ ┃ ┃ 
  ┃ │  │    Page 1    │     │  │    Page 1    │            │  │    Page 1    │    ┃ ┃ 
┃ ┃    └──────────────┘  │     └──────────────┘  │            └──────────────┘  │ ┃ ┃ 
┃ ┃ │  ┌──────────────┐     │  ┌──────────────┐            │  ┌──────────────┐    ┃   
┃ ┃    │    Page 2    │  │     │    Page 2    │  │            │    Page 2    │  │ ┃ ┃ 
  ┃ │  └──────────────┘     │  └──────────────┘      ...   │  └──────────────┘    ┃ ┃ 
┃ ┃          ...         │           ...         │                  ...         │ ┃ ┃ 
┃ ┃ │                       │                              │                      ┃   
┃ ┃    ┌──────────────┐  │     ┌──────────────┐  │            ┌──────────────┐  │ ┃ ┃ 
  ┃ │  │    Page N    │     │  │    Page N    │            │  │    Page N    │    ┃ ┃ 
┃ ┃    └──────────────┘  │     └──────────────┘  │            └──────────────┘  │ ┃ ┃ 
┃ ┃ └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─          └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─  ┃   
┃ ┃      ColumnChunk             ColumnChunk                    ColumnChunk       ┃ ┃ 
  ┃                                                                               ┃ ┃ 
┃ ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━Row Group 1━━┛ ┃ 
┃                                                                                     
┗ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━┛ 
                                                                                      
                                                                  InMemoryRowGroup    
                                                                                      

Describe the solution you'd like

Basically, TLDR only read a subset of the pages into memory at any time and do additional IO to fetch others

┏━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ┓  
┃                                                                                   ┃  
┃ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃  
  ┃ ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─          ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─  ┃    
┃ ┃    ┌──────────────┐  │     ┌──────────────┐  │            ┌──────────────┐  │ ┃ ┃  
┃ ┃ │  │    Page 1    │     │  │    Page 1    │            │  │    Page 1    │    ┃ ┃  
┃ ┃    └──────────────┘  │     └──────────────┘  │            └──────────────┘  │ ┃ ┃  
  ┃ │  ┌──────────────┐     │  ┌──────────────┐            │  ┌──────────────┐    ┃    
┃ ┃    │    Page 2    │  │     │    Page 2    │  │            │    Page 2    │  │ ┃ ┃  
┃ ┃ │  └──────────────┘     │  └──────────────┘      ...   │  └──────────────┘    ┃ ┃  
┃ ┃  ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘          ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘ ┃ ┃  
  ┃      ColumnChunk             ColumnChunk                    ColumnChunk       ┃    
┃ ┃                                                                               ┃ ┃  
┃ ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━Row Group 1━━┛ ┃  
┃                                                                                   ┃  
 ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━ ━━━   
                                                                                       
                                                                   InMemoryRowGroup    
                                                                                       

Describe alternatives you've considered

If course there is a tradeoff between buffer usage and number of I/O requests so there is unlikely to be one setting that works for all use cases. In our examples the two extremes are to

Target I/Os required Buffer Required
Minimize I/O requests (behavior today) 1 request (ObjectStore::get_ranges call) 60MB Buffer
Minimize buffering 500 requests (one for each page) ~1MB buffer (60MB/50 pages)
Intermediate buffering 10 requests (get 5 pages per column per request) 10MB buffer

So I would propose implementing some new optional setting like row_group_page_buffer_size that the reader would respect and try and limit its buffer usage to row_group_page_buffer_size, ideally fetching multiple pages at a time.

large pages (e.g. large string data) are likely to make it impossible to always keep to a particular buffer size

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementAny new improvement worthy of a entry in the changelog

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions