Releases: adbar/trafilatura
Releases · adbar/trafilatura
trafilatura-0.8.0
- improved link discovery and handling
- fixes in metadata extraction, feeds and sitemaps processing
- breaking change: the
extract
function now reads target format fromoutput_format
argument only - new extraction option: preserve links, CLI options re-ordered
- more opportunistic backup extraction
trafilatura-0.7.0
- customizable configuration file to parametrize extraction and downloads
- better handling of feeds and sitemaps
- additional CLI options: crytographic hash for file name, use Internet Archive as backup
- more precise extraction
- faster downloads:
requests
replaced with bareurllib3
and custom decoding - consolidation: bug fixes and improvements, many thanks to the issues reporters!
trafilatura-0.6.1
- added
bare_extraction
function returning Python variables - improved link discovery in feeds and sitemaps
- option to preserve image info
- fixes (many thanks to bug reporters!)
trafilatura-0.6.0
- link discovery in sitemaps
- compatibility with Python 3.9
- extraction coverage improved
- deduplication now optional
- bug fixes
trafilatura-0.5.2
- optional language detector changed:
langid
→pycld3
- helper function
bare_extraction()
- optional deduplication off by default
- better URL handling (
courlan
), more complete metadata - code consolidation (cleaner and shorter)
trafilatura-0.5.1
- extended and more convenient command-line options
- output in JSON format
- bug fixes
trafilatura-0.5.0
- faster and more robust text and metadata extraction
- more efficient batch processing (parallel processing, URL queues)
- support for ATOM/RSS feeds
- complete command-line tool with corresponding options
trafilatura-0.4.1
- better metadata extraction and integration (XML & XML-TEI)
- more efficient processing
- output directory as CLI-option
trafilatura-0.1.0
First release used in production and meant to be archived on Zenodo for reproducibility and citability.