Skip to content

Handle multi-byte decode errors in as_string/as_bytes#1721

Open
krrishapatel wants to merge 1 commit into
Supervisor:mainfrom
krrishapatel:fix/multibyte-decode-error-handling
Open

Handle multi-byte decode errors in as_string/as_bytes#1721
krrishapatel wants to merge 1 commit into
Supervisor:mainfrom
krrishapatel:fix/multibyte-decode-error-handling

Conversation

@krrishapatel

Copy link
Copy Markdown

Summary

When tail -f reads an initial chunk from a log file, the byte boundary can split a multi-byte UTF-8 character (e.g. Korean, Chinese, Japanese). This causes as_string() to either raise UnicodeDecodeError or corrupt the entire chunk.

Adding errors='replace' to the encode()/decode() calls in compat.py ensures that only the truncated character at the boundary shows as , while the rest of the text decodes correctly.

Fixes #1693

Changes

  • supervisor/compat.py: Add errors='replace' to all encode/decode calls (both Python 2 and 3 branches)
  • supervisor/tests/test_compat.py: New test file covering valid UTF-8, Korean text, and incomplete byte sequences

When tail -f reads an initial chunk from a log file, the byte boundary
can split a multi-byte UTF-8 character (e.g. Korean). This causes the
entire chunk to decode incorrectly.

Add errors='replace' to encode/decode calls in compat.py so that
incomplete byte sequences produce the Unicode replacement character
instead of corrupting the rest of the output.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

tail -f garbled output when log has Korean characters

1 participant