Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Chunked parquet reader with small chunk_read_limit does not correctly read nested large string columns. #17692

Open
mhaseeb123 opened this issue Jan 8, 2025 · 3 comments · May be fixed by #17702
Labels
bug Something isn't working

Comments

@mhaseeb123
Copy link
Member

mhaseeb123 commented Jan 8, 2025

Describe the bug
Discovered while investigating #17693.

The chunked parquet reader fails to correctly read nested large string columns for smaller chunk_read_limit (multiple output table chunks per subpass). This is likely because the following:

cudf::detail::sizes_to_offsets(offsets_ptr, offsets_ptr + size + 1, d_offsets64, stream);

sets the first strings column offset to 0 which may not be true due to shared output buffers between input columns as described at

// fill in the arrays on the host. there are some important considerations to
// take into account here for nested columns. specifically, with structs
// there is sharing of output buffers between input columns. consider this schema
//
// required group field_id=1 name {
// required binary field_id=2 firstname (String);
// required binary field_id=3 middlename (String);
// required binary field_id=4 lastname (String);
// }
//

To avoid this, we must consider the str_offset while computing the offsets for large-string columns similar to how we do for non-large string cols here:

if (not s->col.is_large_string_col) {
int value_count = nesting_info_base[leaf_level_index].value_count;
// if no repetition we haven't calculated start/end bounds and instead just skipped
// values until we reach first_row. account for that here.
if (!has_repetition) { value_count -= s->first_row; }
auto const offptr = reinterpret_cast<size_type*>(nesting_info_base[leaf_level_index].data_out);
block_excl_sum<decode_block_size>(offptr, value_count, s->page.str_offset);

Note that it is not trivial to extract the correct str_offset for output chunks in this case.

Steps/Code to reproduce bug
Insert the following GTest in tests/large_strings/parquet_tests.cpp:

#include <cudf_test/column_utilities.hpp>
#include <cudf_test/column_wrapper.hpp>

TEST_F(ParquetStringsTest, ChunkedReadNestedLargeStrings)
{
  using int32s_col  = cudf::test::fixed_width_column_wrapper<int32_t>;
  using strings_col = cudf::test::strings_column_wrapper;
  using structs_col = cudf::test::structs_column_wrapper;

  auto constexpr num_rows = 100'000;

  std::vector<std::unique_ptr<cudf::column>> input_columns;
  auto const int_iter = thrust::make_counting_iterator(0);
  input_columns.emplace_back(int32s_col(int_iter, int_iter + num_rows).release());

  auto offsets = std::vector<cudf::size_type>{};
  offsets.reserve(num_rows * 2);
  cudf::size_type num_structs = 0;
  for (int i = 0; i < num_rows; ++i) {
    offsets.push_back(num_structs);
    auto const new_list_size = i % 4;
    num_structs += new_list_size;
  }
  offsets.push_back(num_structs);

  auto const make_structs_col = [=] {
    auto child1 = int32s_col(int_iter, int_iter + num_structs);
    auto child2 = int32s_col(int_iter + num_structs, int_iter + num_structs * 2);

    auto const str_iter = cudf::detail::make_counting_transform_iterator(
      0, [&](int32_t i) { return std::to_string(i) + std::to_string(i) + std::to_string(i); });
    auto child3 = strings_col{str_iter, str_iter + num_structs};

    return structs_col{{child1, child2, child3}}.release();
  };

  input_columns.emplace_back(
    cudf::make_lists_column(static_cast<cudf::size_type>(offsets.size() - 1),
                            int32s_col(offsets.begin(), offsets.end()).release(),
                            make_structs_col(),
                            0,
                            rmm::device_buffer{}));

  // Expected table
  auto const table    = cudf::table{std::move(input_columns)};
  auto const expected = table.view();

  auto str_col_view = expected.column(1).child(1).child(2);
  EXPECT_TRUE(str_col_view.type().id() == cudf::type_id::STRING);
  std::cout << static_cast<int>(str_col_view.type().id()) << std::endl;
  auto const column_size =
    cudf::strings_column_view(str_col_view).chars_size(cudf::get_default_stream());
  auto const threshold = column_size  / 10; // to ensure all output chunks are large strings
  // set smaller threshold to reduce file size and execution time
  setenv("LIBCUDF_LARGE_STRINGS_THRESHOLD", std::to_string(threshold).c_str(), 1);

  // Host buffer to write Parquet
  auto filepath = g_temp_env->get_temp_filepath("chunked_nested_large_strings.parquet");

  // Writer options
  cudf::io::parquet_writer_options out_opts =
    cudf::io::parquet_writer_options::builder(cudf::io::sink_info{filepath}, expected)
      .max_page_size_bytes(512 * 1024)
      .max_page_size_rows(20000)
      .dictionary_policy(cudf::io::dictionary_policy::ALWAYS)
      .write_v2_headers(false);

  // Write to Parquet
  cudf::io::write_parquet(out_opts);

  // Reader options
  cudf::io::parquet_reader_options in_opts =
    cudf::io::parquet_reader_options::builder(cudf::io::source_info(filepath));

  // Chunked parquet reader
  auto reader = cudf::io::chunked_parquet_reader(1, 0, in_opts);

  // Read chunked
  auto tables = std::vector<std::unique_ptr<cudf::table>>{};
  while (reader.has_next()) {
    tables.emplace_back(reader.read_chunk().tbl);
  }
  auto table_views = std::vector<cudf::table_view>{};
  std::transform(tables.begin(), tables.end(), std::back_inserter(table_views), [](auto& tbl) {
    return tbl->view();
  });
  auto result = cudf::concatenate(table_views);

  // Verify tables to be equal
  CUDF_TEST_EXPECT_TABLES_EQUAL(result->view(), expected);

  // go back to normal threshold
  unsetenv("LIBCUDF_LARGE_STRINGS_THRESHOLD");
}

Expected behavior
The chunked parquet reader should correctly read nested large string columns

Environment overview (please complete the following information)
RDS Lab dgx-05 machine. Devcontainer: cudf25.02, cuda 12.5, conda

Environment details
Devcontainer: cudf25.02, cuda 12.5, conda

Additional context
This issue will Improve the work in #17207

@mhaseeb123 mhaseeb123 added the bug Something isn't working label Jan 8, 2025
@mhaseeb123
Copy link
Member Author

CC: @GregoryKimball

@GregoryKimball
Copy link
Contributor

Would you please post the written file from this test, plus a second file by writing the corrupted table?

@mhaseeb123
Copy link
Member Author

Here are the parquet files nested-large-strings.zip
The column of interest is _col1.list.element._col1_1_2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: No status
2 participants