Skip to content

C and Pure parser accept invalid UTF-8 strings, the Java parser doesn't. #138

Open
@headius

Description

@headius

I know, I know...if it's bad content it's bad content. But this represents a difference from MRI.

Here's the case, again a reduced version of one I got from @rkh:

# encoding: utf-8
require 'json'
x = "{\"foo\":\"\xC3\"}"
h = JSON.parse(x)
p h['foo']
p h['foo'].encoding

So basically there's a bad byte in a UTF-8 string, and the MRI version walks right by it and allows it to come through to the resulting parsed json structure.

I have a totally broken patch for this:

diff --git a/java/src/json/ext/ByteListTranscoder.java b/java/src/json/ext/ByteListTranscoder.java
index ed9e54b..a7e42ba 100644
--- a/java/src/json/ext/ByteListTranscoder.java
+++ b/java/src/json/ext/ByteListTranscoder.java
@@ -78,9 +78,10 @@ abstract class ByteListTranscoder {
             return head;
         }
         if (head <= 0xbf) { // 0b10xxxxxx
-            throw invalidUtf8(); // tail byte with no head
+            return head; //throw invalidUtf8(); // tail byte with no head
         }
         if (head <= 0xdf) { // 0b110xxxxx
+            if (pos + 1 > srcEnd) return head;
             ensureMin(1);
             int cp = ((head  & 0x1f) << 6)
                      | nextPart();

Again, I'm not sure this is actually something that needs to be fixed, but because the MRI version of json does not blow up on this content, there's something to be addressed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions