[BUG] Chunked parquet reader with small `chunk_read_limit` does not correctly read nested large string columns. #17692

mhaseeb123 · 2025-01-08T06:12:08Z

Describe the bug
Discovered while investigating #17693.

The chunked parquet reader fails to correctly read nested large string columns for smaller chunk_read_limit (multiple output table chunks per subpass). This is likely because the following:

cudf/cpp/src/io/utilities/column_buffer_strings.cu

Line 40 in 30c6caa

    
           cudf::detail::sizes_to_offsets(offsets_ptr, offsets_ptr + size + 1, d_offsets64, stream);

sets the first strings column offset to 0 which may not be true due to shared output buffers between input columns as described at

cudf/cpp/src/io/parquet/reader_impl.cpp

Lines 170 to 179 in 30c6caa

    
           // fill in the arrays on the host.  there are some important considerations to 
        
           // take into account here for nested columns.  specifically, with structs 
        
           // there is sharing of output buffers between input columns.  consider this schema 
        
           // 
        
           //  required group field_id=1 name { 
        
           //    required binary field_id=2 firstname (String); 
        
           //    required binary field_id=3 middlename (String); 
        
           //    required binary field_id=4 lastname (String); 
        
           // } 
        
           //

To avoid this, we must consider the str_offset while computing the offsets for large-string columns similar to how we do for non-large string cols here:

cudf/cpp/src/io/parquet/page_string_decode.cu

Lines 1130 to 1138 in 30c6caa

    
           if (not s->col.is_large_string_col) { 
        
             int value_count = nesting_info_base[leaf_level_index].value_count; 
        
             // if no repetition we haven't calculated start/end bounds and instead just skipped 
        
             // values until we reach first_row. account for that here. 
        
             if (!has_repetition) { value_count -= s->first_row; } 
        
             auto const offptr = reinterpret_cast<size_type*>(nesting_info_base[leaf_level_index].data_out); 
        
             block_excl_sum<decode_block_size>(offptr, value_count, s->page.str_offset);

Note that it is not trivial to extract the correct str_offset for output chunks in this case.

Steps/Code to reproduce bug
Insert the following GTest in tests/large_strings/parquet_tests.cpp:

#include <cudf_test/column_utilities.hpp>
#include <cudf_test/column_wrapper.hpp>

TEST_F(ParquetStringsTest, ChunkedReadNestedLargeStrings)
{
  using int32s_col  = cudf::test::fixed_width_column_wrapper<int32_t>;
  using strings_col = cudf::test::strings_column_wrapper;
  using structs_col = cudf::test::structs_column_wrapper;

  auto constexpr num_rows = 100'000;

  std::vector<std::unique_ptr<cudf::column>> input_columns;
  auto const int_iter = thrust::make_counting_iterator(0);
  input_columns.emplace_back(int32s_col(int_iter, int_iter + num_rows).release());

  auto offsets = std::vector<cudf::size_type>{};
  offsets.reserve(num_rows * 2);
  cudf::size_type num_structs = 0;
  for (int i = 0; i < num_rows; ++i) {
    offsets.push_back(num_structs);
    auto const new_list_size = i % 4;
    num_structs += new_list_size;
  }
  offsets.push_back(num_structs);

  auto const make_structs_col = [=] {
    auto child1 = int32s_col(int_iter, int_iter + num_structs);
    auto child2 = int32s_col(int_iter + num_structs, int_iter + num_structs * 2);

    auto const str_iter = cudf::detail::make_counting_transform_iterator(
      0, [&](int32_t i) { return std::to_string(i) + std::to_string(i) + std::to_string(i); });
    auto child3 = strings_col{str_iter, str_iter + num_structs};

    return structs_col{{child1, child2, child3}}.release();
  };

  input_columns.emplace_back(
    cudf::make_lists_column(static_cast<cudf::size_type>(offsets.size() - 1),
                            int32s_col(offsets.begin(), offsets.end()).release(),
                            make_structs_col(),
                            0,
                            rmm::device_buffer{}));

  // Expected table
  auto const table    = cudf::table{std::move(input_columns)};
  auto const expected = table.view();

  auto str_col_view = expected.column(1).child(1).child(2);
  EXPECT_TRUE(str_col_view.type().id() == cudf::type_id::STRING);
  std::cout << static_cast<int>(str_col_view.type().id()) << std::endl;
  auto const column_size =
    cudf::strings_column_view(str_col_view).chars_size(cudf::get_default_stream());
  auto const threshold = column_size  / 10; // to ensure all output chunks are large strings
  // set smaller threshold to reduce file size and execution time
  setenv("LIBCUDF_LARGE_STRINGS_THRESHOLD", std::to_string(threshold).c_str(), 1);

  // Host buffer to write Parquet
  auto filepath = g_temp_env->get_temp_filepath("chunked_nested_large_strings.parquet");

  // Writer options
  cudf::io::parquet_writer_options out_opts =
    cudf::io::parquet_writer_options::builder(cudf::io::sink_info{filepath}, expected)
      .max_page_size_bytes(512 * 1024)
      .max_page_size_rows(20000)
      .dictionary_policy(cudf::io::dictionary_policy::ALWAYS)
      .write_v2_headers(false);

  // Write to Parquet
  cudf::io::write_parquet(out_opts);

  // Reader options
  cudf::io::parquet_reader_options in_opts =
    cudf::io::parquet_reader_options::builder(cudf::io::source_info(filepath));

  // Chunked parquet reader
  auto reader = cudf::io::chunked_parquet_reader(1, 0, in_opts);

  // Read chunked
  auto tables = std::vector<std::unique_ptr<cudf::table>>{};
  while (reader.has_next()) {
    tables.emplace_back(reader.read_chunk().tbl);
  }
  auto table_views = std::vector<cudf::table_view>{};
  std::transform(tables.begin(), tables.end(), std::back_inserter(table_views), [](auto& tbl) {
    return tbl->view();
  });
  auto result = cudf::concatenate(table_views);

  // Verify tables to be equal
  CUDF_TEST_EXPECT_TABLES_EQUAL(result->view(), expected);

  // go back to normal threshold
  unsetenv("LIBCUDF_LARGE_STRINGS_THRESHOLD");
}

Expected behavior
The chunked parquet reader should correctly read nested large string columns

Environment overview (please complete the following information)
RDS Lab dgx-05 machine. Devcontainer: cudf25.02, cuda 12.5, conda

Environment details
Devcontainer: cudf25.02, cuda 12.5, conda

Additional context
This issue will Improve the work in #17207

The text was updated successfully, but these errors were encountered:

mhaseeb123 · 2025-01-08T06:13:28Z

CC: @GregoryKimball

GregoryKimball · 2025-01-08T18:33:34Z

Would you please post the written file from this test, plus a second file by writing the corrupted table?

mhaseeb123 · 2025-01-08T19:30:47Z

Here are the parquet files nested-large-strings.zip
The column of interest is _col1.list.element._col1_1_2

mhaseeb123 added the bug Something isn't working label Jan 8, 2025

mhaseeb123 added this to libcudf Jan 8, 2025

mhaseeb123 added this to the Parquet continuous improvement milestone Jan 8, 2025

mhaseeb123 linked a pull request Jan 9, 2025 that will close this issue

🚧 Compute and use the str_offset when reading nested large string cols with chunked Parquet reader. #17702

Draft

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Chunked parquet reader with small `chunk_read_limit` does not correctly read nested large string columns. #17692

[BUG] Chunked parquet reader with small `chunk_read_limit` does not correctly read nested large string columns. #17692

mhaseeb123 commented Jan 8, 2025 •

edited

Loading

mhaseeb123 commented Jan 8, 2025

GregoryKimball commented Jan 8, 2025

mhaseeb123 commented Jan 8, 2025

[BUG] Chunked parquet reader with small chunk_read_limit does not correctly read nested large string columns. #17692

[BUG] Chunked parquet reader with small chunk_read_limit does not correctly read nested large string columns. #17692

Comments

mhaseeb123 commented Jan 8, 2025 • edited Loading

mhaseeb123 commented Jan 8, 2025

GregoryKimball commented Jan 8, 2025

mhaseeb123 commented Jan 8, 2025

[BUG] Chunked parquet reader with small `chunk_read_limit` does not correctly read nested large string columns. #17692

[BUG] Chunked parquet reader with small `chunk_read_limit` does not correctly read nested large string columns. #17692

mhaseeb123 commented Jan 8, 2025 •

edited

Loading