Skip to content

Conversation

@brauliobo
Copy link

Switches the default HTML parser from Nokogiri::HTML (HTML4) to Nokogiri::HTML5. This provides better parsing accuracy for modern web pages but is stricter regarding encoding.

Changes include:

  • Set Mechanize.html_parser default to Nokogiri::HTML5.
  • Update Mechanize::Util.from_native_charset and html_unescape to handle Nokogiri::HTML5 by checking parser type and falling back to Nokogiri::HTML::NamedCharacters when needed (HTML5 parser lacks this constant).
  • Update Mechanize::Page#parser to rescue Encoding::UndefinedConversionError and ArgumentError during encoding detection, as Nokogiri::HTML5 raises errors for invalid encodings where HTML4 was lenient.
  • Fix test/htdocs/frame_test.html to be valid HTML5 (move content into NOFRAMES) to allow proper testing.
  • Update tests to expect UTF-8 encoding when using HTML5 parser (which enforces UTF-8).
  • Update legacy frameset tests to use compatible pages when running with HTML5 parser, as it handles NOFRAMES content differently.

Closes #592

Switches the default HTML parser from `Nokogiri::HTML` (HTML4) to `Nokogiri::HTML5`. This provides better parsing accuracy for modern web pages but is stricter regarding encoding.

Changes include:
- Set `Mechanize.html_parser` default to `Nokogiri::HTML5`.
- Update `Mechanize::Util.from_native_charset` and `html_unescape` to handle `Nokogiri::HTML5` by checking parser type and falling back to `Nokogiri::HTML::NamedCharacters` when needed (HTML5 parser lacks this constant).
- Update `Mechanize::Page#parser` to rescue `Encoding::UndefinedConversionError` and `ArgumentError` during encoding detection, as Nokogiri::HTML5 raises errors for invalid encodings where HTML4 was lenient.
- Fix `test/htdocs/frame_test.html` to be valid HTML5 (move content into NOFRAMES) to allow proper testing.
- Update tests to expect UTF-8 encoding when using HTML5 parser (which enforces UTF-8).
- Update legacy frameset tests to use compatible pages when running with HTML5 parser, as it handles NOFRAMES content differently.

Closes sparklemotion#592
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

HTML5 support

1 participant