Fix handling of unbuffered byte streams that split UTF-8 char across read() calls #105

smheidrich · 2025-01-08T01:10:25Z

daggaz/json-stream#59 ran into an issue with json-stream's feature allowing users to pass iterables of bytes to load, which makes it wrap them in a RawIOBase interface that pulls data from these iterables when read.

That exposed a more general bug in json-stream-rs-tokenizer when it comes to handling unbuffered byte streams that return UTF-8 chars split across multiple read() calls.

This PR adds a regression test for the specific case that brought this up and fixes it.

In the future, I should add lower-level tests to ensure that situations like this are handled correctly for all other kinds of streams as well, not just unbuffered bytes (#107, #108).

smheidrich added 2 commits January 8, 2025 03:01

Add test for load() iterable split in UTF-8 char

9b45fd0

Fix UTF-8 split across read() calls (unbuf. bytes)

fd946ca

smheidrich force-pushed the fix-json-stream-gh-59-iterable-of-bytes branch from f3b4311 to fd946ca Compare January 8, 2025 02:01

smheidrich merged commit d419f3b into main Jan 8, 2025
4 checks passed

This was referenced Jan 8, 2025

Should have tests to verify that UTF-8 chars split across read() calls work for all types of streams #107

Open

Buffering tests should have cases for BytesIO/StringIO variants that return less data on read() #108

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix handling of unbuffered byte streams that split UTF-8 char across read() calls #105

Fix handling of unbuffered byte streams that split UTF-8 char across read() calls #105

smheidrich commented Jan 8, 2025 •

edited

Loading

Fix handling of unbuffered byte streams that split UTF-8 char across read() calls #105

Fix handling of unbuffered byte streams that split UTF-8 char across read() calls #105

Conversation

smheidrich commented Jan 8, 2025 • edited Loading

smheidrich commented Jan 8, 2025 •

edited

Loading