Skip to content

Commit

Permalink
evaluation: add data, rewrite script, update packages (adbar#606)
Browse files Browse the repository at this point in the history
* new functions, add output, include small

* further code clean-up

* add output files

* start object-oriented restructuring

* add pandas

* add eval data as json file

* adjust print statements

* fix outputs

* Update eval-requirements.txt

* add results dir

* round in output files

* add new evaluation files

* adjust new files

* add further files

* re-run evaluation with trafilatura 1.9.0 and new data

* finalize changes

* review structure and setup

* adapt evaluation

* simplify code, test and improve usability

* replace empty file

* rm double entry

* rm further duplicates

* add html2txt and update docs

* regroup, use binary as input, test

* fixes

---------

Co-authored-by: Adrien Barbaresi <[email protected]>
  • Loading branch information
lykoerber and adbar authored Jun 4, 2024
1 parent b36b6fa commit 950c348
Show file tree
Hide file tree
Showing 301 changed files with 474,233 additions and 2,175 deletions.
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,10 @@ build/
.tox/
.coverage

# evaluation
results/
venv/

# docs
docs/_autosummary/
docs/_build/
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,7 @@ limiting noise and including all valid parts.

For more information see the [benchmark section](https://trafilatura.readthedocs.io/en/latest/evaluation.html)
and the [evaluation readme](https://github.com/adbar/trafilatura/blob/master/tests/README.rst)
to reproduce the results.
to run the evaluation with the latest data and packages.

**750 documents, 2236 text & 2250 boilerplate segments (2022-05-18), Python 3.8**

Expand Down
4 changes: 4 additions & 0 deletions docs/evaluation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,10 @@ Although text is ubiquitous on the Web, extracting information from web pages ca
The extraction focuses on the main content, which is usually the part displayed centrally, without the left or right bars, the header or the footer, but including potential titles and (optionally) comments. This task is also known as web scraping, boilerplate removal, DOM-based content extraction, main content identification, or web page cleaning.


.. hint::
To run the evaluation with the latest data and packages see the `corresponding readme <https://github.com/adbar/trafilatura/blob/master/tests/README.rst>`_ .


External evaluations
--------------------

Expand Down
28 changes: 20 additions & 8 deletions tests/README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,8 @@ The multilingual evaluation features a wide array of different websites: news ou

The benchmark focuses on decisive text parts, mostly at the beginning and the end of the main text where errors often happen. Other difficult segments throughout the document are chosen to enhance detection of false positives, and segments in particular sections (e.g. quotes or lists) are taken to see if all necessary parts of a document are present in the output.

These decisions are prompted by the need to find cost-efficient ways to define a gold standard and annotate a series of documents.


Caveats
-------
Expand All @@ -19,32 +21,38 @@ This type of evaluation does not probe for duplicate segments, but Trafilatura f

It is not evaluated whether the extracted segments are in the right order, although they are generally few and far apart.

These decisions are prompted by the need to find cost-efficient ways to define a gold standard and annotate a series of documents. More comprehensive evaluations are available, mostly focusing on English and/or a particular text type.


Running the code
^^^^^^^^^^^^^^^^

The results and a list of comparable benchmarks are available on the `evaluation page of the docs <https://trafilatura.readthedocs.io/en/latest/evaluation.html>`_.


Trafilatura evaluation
----------------------
Evaluation
----------

The following allows for comparing changes made to Trafilatura, for example in a new version or pull request:

1. Install Trafilatura
2. Run the script ``comparison_small.py``


Full evaluation
---------------

A comparison with similar software is run periodically. As the packages tend to evolve the script may not always be up-to-date and all packages may not be available. If that happens, commenting out the corresponding sections is the most efficient solution. Fixes to the file can be submitted as pull requests.

Note: As numerous packages are installed it is recommended to create a virtual environment, for example with with ``pyenv`` or ``venv``.

1. Install the packages specified in ``eval-requirements.txt``
2. Run the script ``comparison.py`` (some packages are slow, it can be a while)
2. Run the script ``evaluate.py``

Options:

- ``--all``: Run all the supported algorithms (some packages are slow, it can be a while)
- ``--small``: Run Trafilatura-based components
- ``--algorithms "html2txt" "html_text"`` (for example): Compare Trafilatura's ``html2txt`` extractor with the ``html_text`` package

``python3 evaluate.py --help``: Display all algorithms and further options.

More comprehensive evaluations are available, mostly focusing on English and/or a particular text type. With minimal adaptations, the evaluation can support the use gold standard files in JSON format.


Sources
Expand All @@ -61,3 +69,7 @@ HTML archives

- Additional German news sites: diskursmonitor.de, courtesy of Jan Oliver Rüdiger.

Evaluation scripts
------------------

Adrien Barbaresi, Lydia Körber.
Loading

0 comments on commit 950c348

Please sign in to comment.