evaluation: add data, rewrite script, update packages (adbar#606)

* new functions, add output, include small * further code clean-up * add output files * start object-oriented restructuring * add pandas * add eval data as json file * adjust print statements * fix outputs * Update eval-requirements.txt * add results dir * round in output files * add new evaluation files * adjust new files * add further files * re-run evaluation with trafilatura 1.9.0 and new data * finalize changes * review structure and setup * adapt evaluation * simplify code, test and improve usability * replace empty file * rm double entry * rm further duplicates * add html2txt and update docs * regroup, use binary as input, test * fixes --------- Co-authored-by: Adrien Barbaresi <[email protected]>
steineggerroland · Jun 4, 2024 · 950c348 · 950c348
1 parent b36b6fa
commit 950c348
Show file tree

Hide file tree

Showing 301 changed files with 474,233 additions and 2,175 deletions.
diff --git a/.gitignore b/.gitignore
@@ -17,6 +17,10 @@ build/
 .tox/
 .coverage
 
+# evaluation
+results/
+venv/
+
 # docs
 docs/_autosummary/
 docs/_build/

diff --git a/README.md b/README.md
@@ -85,7 +85,7 @@ limiting noise and including all valid parts.
 
 For more information see the [benchmark section](https://trafilatura.readthedocs.io/en/latest/evaluation.html)
 and the [evaluation readme](https://github.com/adbar/trafilatura/blob/master/tests/README.rst)
-to reproduce the results.
+to run the evaluation with the latest data and packages.
 
 **750 documents, 2236 text & 2250 boilerplate segments (2022-05-18), Python 3.8**
 

diff --git a/docs/evaluation.rst b/docs/evaluation.rst
@@ -14,6 +14,10 @@ Although text is ubiquitous on the Web, extracting information from web pages ca
 The extraction focuses on the main content, which is usually the part displayed centrally, without the left or right bars, the header or the footer, but including potential titles and (optionally) comments. This task is also known as web scraping, boilerplate removal, DOM-based content extraction, main content identification, or web page cleaning.
 
 
+.. hint::
+    To run the evaluation with the latest data and packages see the `corresponding readme <https://github.com/adbar/trafilatura/blob/master/tests/README.rst>`_ .
+
+
 External evaluations
 --------------------
 

diff --git a/tests/README.rst b/tests/README.rst
@@ -11,6 +11,8 @@ The multilingual evaluation features a wide array of different websites: news ou
 
 The benchmark focuses on decisive text parts, mostly at the beginning and the end of the main text where errors often happen. Other difficult segments throughout the document are chosen to enhance detection of false positives, and segments in particular sections (e.g. quotes or lists) are taken to see if all necessary parts of a document are present in the output.
 
+These decisions are prompted by the need to find cost-efficient ways to define a gold standard and annotate a series of documents.
+
 
 Caveats
 -------
@@ -19,32 +21,38 @@ This type of evaluation does not probe for duplicate segments, but Trafilatura f
 
 It is not evaluated whether the extracted segments are in the right order, although they are generally few and far apart.
 
-These decisions are prompted by the need to find cost-efficient ways to define a gold standard and annotate a series of documents. More comprehensive evaluations are available, mostly focusing on English and/or a particular text type.
-
 
 Running the code
 ^^^^^^^^^^^^^^^^
 
 The results and a list of comparable benchmarks are available on the `evaluation page of the docs <https://trafilatura.readthedocs.io/en/latest/evaluation.html>`_.
 
 
-Trafilatura evaluation
-----------------------
+Evaluation
+----------
 
 The following allows for comparing changes made to Trafilatura, for example in a new version or pull request:
 
 1. Install Trafilatura
 2. Run the script ``comparison_small.py``
 
 
-Full evaluation
----------------
-
 A comparison with similar software is run periodically. As the packages tend to evolve the script may not always be up-to-date and all packages may not be available. If that happens, commenting out the corresponding sections is the most efficient solution. Fixes to the file can be submitted as pull requests.
 
+Note: As numerous packages are installed it is recommended to create a virtual environment, for example with with ``pyenv`` or ``venv``.
 
 1. Install the packages specified in ``eval-requirements.txt``
-2. Run the script ``comparison.py`` (some packages are slow, it can be a while)
+2. Run the script ``evaluate.py``
+
+Options:
+
+- ``--all``: Run all the supported algorithms (some packages are slow, it can be a while)
+- ``--small``: Run Trafilatura-based components
+- ``--algorithms "html2txt" "html_text"`` (for example): Compare Trafilatura's ``html2txt`` extractor with the ``html_text`` package
+
+``python3 evaluate.py --help``: Display all algorithms and further options.
+
+More comprehensive evaluations are available, mostly focusing on English and/or a particular text type. With minimal adaptations, the evaluation can support the use gold standard files in JSON format.
 
 
 Sources
@@ -61,3 +69,7 @@ HTML archives
 
 - Additional German news sites: diskursmonitor.de, courtesy of Jan Oliver Rüdiger.
 
+Evaluation scripts
+------------------
+
+Adrien Barbaresi, Lydia Körber.