You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
A customer recently ran into a hang when they read a parquet file with invalid UTF-8 characters in it and then tried to process those with a regular expression. CUDF is looking into the hang itself, but we should be checking/normalizing UTF-8 input from parquet files, and possibly others in a way that is compatible with what java/spark does. Java will find invalid bytes and replace them with U+FFFD. Technically old IBM JDKs behaved differently than Sun JDKs, but with any JVM that can run spark this should be the behavior that we would expect to see.
We probably want to write two custom kernels for doing this as fast as possible. Because invalid UTF-8 should be rare, we should probably have a simple kernel that can do the validation in a byte parallel way and just flag that any string in the sequence was invalid. But for the fixup kernel we can be less concerned about performance, or even possibly do it on the CPU. But either way we don't want to copy the data for nested types unless we know that the input data needs to be changed in some way.
The text was updated successfully, but these errors were encountered:
Describe the bug
A customer recently ran into a hang when they read a parquet file with invalid UTF-8 characters in it and then tried to process those with a regular expression. CUDF is looking into the hang itself, but we should be checking/normalizing UTF-8 input from parquet files, and possibly others in a way that is compatible with what java/spark does. Java will find invalid bytes and replace them with U+FFFD. Technically old IBM JDKs behaved differently than Sun JDKs, but with any JVM that can run spark this should be the behavior that we would expect to see.
We probably want to write two custom kernels for doing this as fast as possible. Because invalid UTF-8 should be rare, we should probably have a simple kernel that can do the validation in a byte parallel way and just flag that any string in the sequence was invalid. But for the fixup kernel we can be less concerned about performance, or even possibly do it on the CPU. But either way we don't want to copy the data for nested types unless we know that the input data needs to be changed in some way.
The text was updated successfully, but these errors were encountered: