[BUG] Chunked parquet reader with small chunk_read_limit
does not correctly read nested large string columns.
#17692
Labels
bug
Something isn't working
Milestone
Describe the bug
Discovered while investigating #17693.
The chunked parquet reader fails to correctly read nested large string columns for smaller
chunk_read_limit
(multiple output table chunks per subpass). This is likely because the following:cudf/cpp/src/io/utilities/column_buffer_strings.cu
Line 40 in 30c6caa
sets the first strings column offset to
0
which may not be true due to shared output buffers between input columns as described atcudf/cpp/src/io/parquet/reader_impl.cpp
Lines 170 to 179 in 30c6caa
To avoid this, we must consider the
str_offset
while computing theoffsets
for large-string columns similar to how we do for non-large string cols here:cudf/cpp/src/io/parquet/page_string_decode.cu
Lines 1130 to 1138 in 30c6caa
Note that it is not trivial to extract the correct
str_offset
for output chunks in this case.Steps/Code to reproduce bug
Insert the following GTest in
tests/large_strings/parquet_tests.cpp
:Expected behavior
The chunked parquet reader should correctly read nested large string columns
Environment overview (please complete the following information)
RDS Lab dgx-05 machine. Devcontainer: cudf25.02, cuda 12.5, conda
Environment details
Devcontainer: cudf25.02, cuda 12.5, conda
Additional context
This issue will Improve the work in #17207
The text was updated successfully, but these errors were encountered: