Skip to content

Malformed HTML parsed differently from browsers #147

Open
@demurgos

Description

@demurgos

I have the following input HTML file:

<html><body><div><a hr</div><div><div></div>
<div><a href="/">bar</a></div></div></body></html>

Notice the unclosed <a tag (this is a minimal repro, in my case it's coming from an accidentally truncated DB value).

If I open it in a browser (Firefox/Chrome) and print its DOM with document.getElementsByTagName("html")[0].outerHTML , I get:

<html><head></head><body>
<div id="div0">
  <a hr="" <="" div="">
</a><div id="div1"><a hr="" <="" div="">
  <div id="div2"></div>
  </a><div id="div3"><a hr="" <="" div="">
    </a><a href="/">bar</a>
  </div>
</div>
</body></html>

With scraper, if I parse it with Html::parse_document and print it with doc.root_element().html(), I get:

<html><head></head><body><div><a hr<="" div=""></a><div><a hr<="" div=""><div></div>
</div>
</div></body></html>

Notice that the anchor tag with text bar is missing!

Running this input with html5ever's example sinks, I get an input close to browsers (but still not the same, see servo/html5ever#512).

It seems to indicate that there's an issue with scraper's TreeSink implementation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions