Skip to content

Feature: Callback like onRemoveNode before a node is being removed #856

@zirkelc

Description

@zirkelc

I have a similar use case #799 where a node is being removed because it the class name contains the header keyword which is matched by REGEXPS.unlikelyCandidates:

readability/Readability.js

Lines 122 to 125 in d64951b

REGEXPS: {
// NOTE: These two regular expressions are duplicated in
// Readability-readerable.js. Please keep both copies in sync.
unlikelyCandidates: /-ad-|ai2html|banner|breadcrumbs|combx|comment|community|cover-wrap|disqus|extra|footer|gdpr|header|legends|menu|related|remark|replies|rss|shoutbox|sidebar|skyscraper|social|sponsor|supplemental|ad-break|agegate|pagination|pager|popup|yom-remote/i,

Of course I could fork and adapt the regex. However, I think it would be better if there was a generic and dynamic approach to influence the algorithm. For example a callback that is invoked every time a node is being removed by the algorithm, something like this:

var article = new Readability(document, {
    onRemoveNode: (node) => {
        // get all heading elements inside the node
        const headings = this._getAllNodesWithTag(node, ["h1", "h2", "h3", "h4", "h5", "h6"]).length;
        
        // remove node only if it doesn't contain any heading elements
        return headings.length === 0;
    }
});

This callback could be invoked directly from _removeAndGetNext:

readability/Readability.js

Lines 793 to 797 in d64951b

_removeAndGetNext: function(node) {
var nextNode = this._getNextNode(node, true);
node.parentNode.removeChild(node);
return nextNode;
},

If there is any interest in this, I'd willing to submit a PR.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions