MXParser fails to correctly parse the ampersand after the character from the #x10000-#x10FFFD Unicode range #7

aliaksei-burlakou · 2025-01-14T07:11:42Z

Expected Behavior

XML document with the encoded Unicode characters from the #x10000-#x10FFFD Unicode range (like 𐀀 or 􏰍) should be parsed by MXParser without any issues with these characters or any other valid characters, regardless of their location in the document.

Actual Behavior

MXParser erroneously appends a replacement character (�) after the ampersand during parsing if the XML document contains a character from the #x10000-#x10FFFD Unicode range somewhere before the ampersand in the XML.

Steps to reproduce

Java 17 (Amazon Corretto JDK, build 17.0.11+9-LTS), XStream v1.4.21
The 􏰍 encoded character (􏰍, U+10FC0D, HEX: F4 8F B0 8D) should present somewhere in the XML document before the encoded ampersand (&).

Simple code example:

RootTag class:

@XStreamAlias("rootTag")
public class RootTag {
    @XStreamAlias("text")
    private TextTag text;

    public TextTag getText() {
        return text;
    }
}

TextTag class:

@XStreamConverter(value = ToAttributedValueConverter.class, strings = {"value"})
@XStreamAlias("textTag")
public class TextTag {
    private String value;

    public String getValue() {
        return value;
    }
}

Test class with the simple XML input:

class XStreamTest {

    @Test
    void testXStreamFailsToParseAmpersandAfterSupplementaryCharacter() throws Exception {
        String input = """
                <?xml version="1.0" encoding="UTF-8"?>
                <rootTag>
                    <text>Test: &amp; ampersand before, supplementary character &#1113101;, ampersand &amp; after</text>
                </rootTag>""";

        XStream xStream = new XStream();
        xStream.processAnnotations(RootTag.class);
        xStream.addPermission(new ExplicitTypePermission(new Class[]{RootTag.class}));

        try (InputStream is = new ByteArrayInputStream(input.getBytes(StandardCharsets.UTF_8))) {
            RootTag rootTag = (RootTag) xStream.fromXML(is);
            assertEquals("Test: & ampersand before, supplementary symbol \uDBFF\uDC0D, ampersand & after",
                    rootTag.getText().getValue());
        }
    }
}

Output:

Expected :Test: & ampersand before, supplementary character 􏰍, ampersand & after
Actual   :Test: & ampersand before, supplementary character 􏰍, ampersand &� after

NOTE

This issue was initially reported here: x-stream/xstream#368

The text was updated successfully, but these errors were encountered:

aliaksei-burlakou mentioned this issue Jan 14, 2025

XStream fails to correctly parse the ampersand after the character from the Unicode Supplementary Private Use Area-B x-stream/xstream#368

Open

aliaksei-burlakou changed the title ~~MXParser fails to correctly parse the ampersand after the character from the Unicode Supplementary Private Use Area-B~~ MXParser fails to correctly parse the ampersand after the character from the #x10000-#x10FFFD Unicode range Jan 20, 2025

joehni self-assigned this Jan 29, 2025

joehni added the bug Something isn't working label Jan 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MXParser fails to correctly parse the ampersand after the character from the #x10000-#x10FFFD Unicode range #7

MXParser fails to correctly parse the ampersand after the character from the #x10000-#x10FFFD Unicode range #7

aliaksei-burlakou commented Jan 14, 2025 •

edited

Loading

MXParser fails to correctly parse the ampersand after the character from the #x10000-#x10FFFD Unicode range #7

MXParser fails to correctly parse the ampersand after the character from the #x10000-#x10FFFD Unicode range #7

Comments

aliaksei-burlakou commented Jan 14, 2025 • edited Loading

Expected Behavior

Actual Behavior

Steps to reproduce

NOTE

aliaksei-burlakou commented Jan 14, 2025 •

edited

Loading