Leading whitespace treatment in xmlhtml is inconsistent.
Consider the following behavior of parseXML (the same happens with parseHTML) where leading whitespace is always dropped, and trailing whitespace is always kept:
> parseXML "x" ""
Right (XmlDocument {docEncoding = UTF8, docType = Nothing, docContent = []})
> parseXML "x" " "
Right (XmlDocument {docEncoding = UTF8, docType = Nothing, docContent = []})
> parseXML "x" "x"
Right (XmlDocument {docEncoding = UTF8, docType = Nothing, docContent = [TextNode "x"]})
> parseXML "x" " x"
Right (XmlDocument {docEncoding = UTF8, docType = Nothing, docContent = [TextNode "x"]})
> parseXML "x" "x "
Right (XmlDocument {docEncoding = UTF8, docType = Nothing, docContent = [TextNode "x "]})
> parseXML "x" " x "
Right (XmlDocument {docEncoding = UTF8, docType = Nothing, docContent = [TextNode "x "]})
See what happens, however, when the “leading whitespace” comes after some element:
> parseXML "x" "<a/> b "
Right (HtmlDocument {docEncoding = UTF8, docType = Nothing, docContent = [Element {elementTag = "a", elementAttrs = [], elementChildren = []},TextNode " b "]})
These two examples behave differently, and I think the correct behavior is the one from the latter example, since xmlhtml should not be discarding the contents of a text node.
So, my proposal is:
-
Keep the behavior of leading whitespace after an element as it is today.
-
Keep the behavior of trailing whitespace everywhere as it is today.
-
Fix top-level text node parsing so that it doesn't discard leading whitespace.
Leading whitespace treatment in
xmlhtmlis inconsistent.Consider the following behavior of
parseXML(the same happens withparseHTML) where leading whitespace is always dropped, and trailing whitespace is always kept:See what happens, however, when the “leading whitespace” comes after some element:
These two examples behave differently, and I think the correct behavior is the one from the latter example, since
xmlhtmlshould not be discarding the contents of a text node.So, my proposal is:
Keep the behavior of leading whitespace after an element as it is today.
Keep the behavior of trailing whitespace everywhere as it is today.
Fix top-level text node parsing so that it doesn't discard leading whitespace.