Skip to content

Inconsistent whitespace treatement #35

@k0001

Description

@k0001

Leading whitespace treatment in xmlhtml is inconsistent.

Consider the following behavior of parseXML (the same happens with parseHTML) where leading whitespace is always dropped, and trailing whitespace is always kept:

> parseXML "x" ""
Right (XmlDocument {docEncoding = UTF8, docType = Nothing, docContent = []})
> parseXML "x" " "
Right (XmlDocument {docEncoding = UTF8, docType = Nothing, docContent = []})
> parseXML "x" "x"
Right (XmlDocument {docEncoding = UTF8, docType = Nothing, docContent = [TextNode "x"]})
> parseXML "x" " x"
Right (XmlDocument {docEncoding = UTF8, docType = Nothing, docContent = [TextNode "x"]})
> parseXML "x" "x "
Right (XmlDocument {docEncoding = UTF8, docType = Nothing, docContent = [TextNode "x "]})
> parseXML "x" " x "
Right (XmlDocument {docEncoding = UTF8, docType = Nothing, docContent = [TextNode "x "]})

See what happens, however, when the “leading whitespace” comes after some element:

> parseXML "x" "<a/> b "
Right (HtmlDocument {docEncoding = UTF8, docType = Nothing, docContent = [Element {elementTag = "a", elementAttrs = [], elementChildren = []},TextNode " b "]})

These two examples behave differently, and I think the correct behavior is the one from the latter example, since xmlhtml should not be discarding the contents of a text node.

So, my proposal is:

  • Keep the behavior of leading whitespace after an element as it is today.

  • Keep the behavior of trailing whitespace everywhere as it is today.

  • Fix top-level text node parsing so that it doesn't discard leading whitespace.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions