[BUG] Invalid UTF8 can be read from parquet (and probably other formats) #12177

revans2 · 2025-02-19T16:12:07Z

Describe the bug
A customer recently ran into a hang when they read a parquet file with invalid UTF-8 characters in it and then tried to process those with a regular expression. CUDF is looking into the hang itself, but we should be checking/normalizing UTF-8 input from parquet files, and possibly others in a way that is compatible with what java/spark does. Java will find invalid bytes and replace them with U+FFFD. Technically old IBM JDKs behaved differently than Sun JDKs, but with any JVM that can run spark this should be the behavior that we would expect to see.

We probably want to write two custom kernels for doing this as fast as possible. Because invalid UTF-8 should be rare, we should probably have a simple kernel that can do the validation in a byte parallel way and just flag that any string in the sequence was invalid. But for the fixup kernel we can be less concerned about performance, or even possibly do it on the CPU. But either way we don't want to copy the data for nested types unless we know that the input data needs to be changed in some way.

sameerz · 2025-02-19T21:27:20Z

Fix for the hang in cudf: rapidsai/cudf#18039

revans2 added ? - Needs Triage Need team to review and classify bug Something isn't working labels Feb 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Invalid UTF8 can be read from parquet (and probably other formats) #12177

[BUG] Invalid UTF8 can be read from parquet (and probably other formats) #12177

revans2 commented Feb 19, 2025

sameerz commented Feb 19, 2025

[BUG] Invalid UTF8 can be read from parquet (and probably other formats) #12177

[BUG] Invalid UTF8 can be read from parquet (and probably other formats) #12177

Comments

revans2 commented Feb 19, 2025

sameerz commented Feb 19, 2025