-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update _patch.py #39341
Update _patch.py #39341
Conversation
The current implementation of the line splitting logic in the code does not handle non-ASCII characters properly. Specifically, the following line: line_list: List[str] = re.split(r"(?<=\n)", element.decode("utf-8")) This line attempts to decode the element byte string using UTF-8 encoding. If the element contains non-ASCII characters that cannot be decoded, it will raise a UnicodeDecodeError. To address this issue, we should add an error handling mechanism to the decode method. There are several options available: errors="replace": Replace undecodable characters with a replacement character (usually �). errors="ignore": Ignore undecodable characters. errors="backslashreplace": Replace undecodable characters with \x escape sequences. errors="surrogateescape": Save undecodable characters as surrogate characters for later recovery. For this pull request, I propose using errors="replace" or errors="ignore".
Thank you for your contribution @952446418! We will review the pull request and get back to you soon. |
API change check API changes are not detected in this pull request. |
@@ -338,7 +338,7 @@ def _deserialize_and_add_to_queue(self, element: bytes) -> bool: | |||
|
|||
# Convert `bytes` to string and split the string by newline, while keeping the new line char. | |||
# the last may be a partial "line" that does not contain a newline char at the end. | |||
line_list: List[str] = re.split(r"(?<=\n)", element.decode("utf-8")) | |||
line_list: List[str] = re.split(r"(?<=\n)", element.decode("utf-8", errors="replace")) # or errors="ignore" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@952446418 can you please let me know of a test where the current handling is failing for you, and which package version you're using? This may have been fixed in a recent change in a different way, as I am unable to force an error, so I would like to validate that first before merging this change.
Closing this PR, as latest release of the azure-ai-inference SDK includes a fix for this issue. |
The current implementation of the line splitting logic in the code does not handle non-ASCII characters properly. Specifically, the following line: line_list: List[str] = re.split(r"(?<=\n)", element.decode("utf-8")) This line attempts to decode the element byte string using UTF-8 encoding. If the element contains non-ASCII characters that cannot be decoded, it will raise a UnicodeDecodeError.
To address this issue, we should add an error handling mechanism to the decode method. There are several options available:
errors="replace": Replace undecodable characters with a replacement character (usually �). errors="ignore": Ignore undecodable characters.
errors="backslashreplace": Replace undecodable characters with \x escape sequences. errors="surrogateescape": Save undecodable characters as surrogate characters for later recovery. For this pull request, I propose using errors="replace" or errors="ignore".