Skip to content

Commit

Permalink
prepare new version: 1.2.0
Browse files Browse the repository at this point in the history
  • Loading branch information
adbar committed Mar 7, 2022
1 parent b877cac commit daf5d8d
Show file tree
Hide file tree
Showing 10 changed files with 321 additions and 322 deletions.
6 changes: 6 additions & 0 deletions HISTORY.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,11 @@
## History / Changelog

### 1.2.0
- efficiency: replaced module readability-lxml by trimmed fork
- bug fixed: (#179, #180, #183, #184)
- improved baseline extraction
- cleaner metadata (with @felipehertzer)


### 1.1.0
- encodings: better detection, output NFC-normalized Unicode
Expand Down
15 changes: 7 additions & 8 deletions MANIFEST.in
Original file line number Diff line number Diff line change
@@ -1,17 +1,16 @@
include CITATION.cff CONTRIBUTING.md HISTORY.md README.rst LICENSE pytest.ini
graft trafilatura/data/
include trafilatura/settings.cfg

include tests/__init__.py
include tests/*test*.py
include tests/eval-requirements.txt tests/README.rst
graft tests/cache/
graft tests/resources/
exclude tests/realworld_tests.py
recursive-exclude tests/cache/

recursive-exclude * __pycache__
recursive-exclude * *.py[co]

recursive-include conf.py Makefile make.bat *.jpg *.png

recursive-include docs/ conf.py Makefile make.bat *.rst *.gif *.jpg *.png
include docs/requirements.txt
recursive-include docs *.rst *.gif *.jpg *.png
recursive-include docs/_build/ *.gif *.jpg *.png

recursive-exclude * __pycache__
recursive-exclude * *.py[co]
2 changes: 1 addition & 1 deletion docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ Description

Distinguishing between a whole page and the page's essential parts can help to alleviate many quality problems related to web text processing, by dealing with the noise caused by recurring elements (headers and footers, ads, links/blogroll, etc.).

The extractor aims to be precise enough in order not to miss texts or discard valid documents. In addition, it must be robust and reasonably fast. With these objectives in mind, it is designed to run in production on millions of web documents. It is based on `lxml <http://lxml.de/>`_ as well as `readability <https://github.com/buriy/python-readability>`_ and `jusText <http://corpus.tools/wiki/Justext>`_ used as fallback.
The extractor aims to be precise enough in order not to miss texts or discard valid documents. In addition, it must be robust and reasonably fast. With these objectives in mind, it is designed to run in production on millions of web documents. It is based on `lxml <http://lxml.de/>`_ and on generic algorithms used as fallback (`jusText <http://corpus.tools/wiki/Justext>`_ and a fork of readability-lxml).

The intended audience encompasses disciplines where collecting web pages represents an important step for data collection, notably linguistics, natural language processing and social sciences. In general, it is relevant for anyone interested in gathering texts from the Web, e.g. web crawling and scraping-intensive fields like information security and search engine optimization.

Expand Down
4 changes: 2 additions & 2 deletions tests/eval-requirements.txt
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
trafilatura==1.0.0
trafilatura==1.2.0

# alternatives
boilerpy3==1.0.5
boilerpy3==1.0.6
dragnet==2.0.4
goose3==3.1.11
html2text==2020.1.16
Expand Down
305 changes: 0 additions & 305 deletions tests/metadata_tests.py

Large diffs are not rendered by default.

303 changes: 301 additions & 2 deletions tests/realworld_tests.py

Large diffs are not rendered by default.

File renamed without changes.
File renamed without changes.
6 changes: 3 additions & 3 deletions tests/unit_tests.py
Original file line number Diff line number Diff line change
Expand Up @@ -54,12 +54,12 @@
def load_mock_page(url, xml_flag=False, langcheck=None, tei_output=False):
'''load mock page from samples'''
try:
with open(os.path.join(TEST_DIR, 'cache', MOCK_PAGES[url]), 'r') as inputf:
with open(os.path.join(TEST_DIR, 'resources', MOCK_PAGES[url]), 'r') as inputf:
htmlstring = inputf.read()
# encoding/windows fix for the tests
except UnicodeDecodeError:
# read as binary
with open(os.path.join(TEST_DIR, 'cache', MOCK_PAGES[url]), 'rb') as inputf:
with open(os.path.join(TEST_DIR, 'resources', MOCK_PAGES[url]), 'rb') as inputf:
htmlbinary = inputf.read()
guessed_encoding = detect(htmlbinary)['encoding']
if guessed_encoding is not None:
Expand Down Expand Up @@ -151,7 +151,7 @@ def test_exotic_tags(xmloutput=False):
# cover some edge cases with a specially crafted file
result = load_mock_page('http://exotic_tags', xml_flag=xmloutput, tei_output=True)
assert 'Teletype text' in result and 'My new car is silver.' in result
filepath = os.path.join(TEST_DIR, 'cache', 'exotic_tags_tei.html')
filepath = os.path.join(TEST_DIR, 'resources', 'exotic_tags_tei.html')
with open(filepath) as f:
content = etree.fromstring(f.read())
res = xml.check_tei(content, 'http://dummy')
Expand Down
2 changes: 1 addition & 1 deletion trafilatura/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
__author__ = 'Adrien Barbaresi and contributors'
__license__ = 'GNU GPL v3+'
__copyright__ = 'Copyright 2019-2022, Adrien Barbaresi'
__version__ = '1.1.0'
__version__ = '1.2.0'


import logging
Expand Down

0 comments on commit daf5d8d

Please sign in to comment.