Skip to content

Releases: adbar/trafilatura

trafilatura-0.8.0

19 Feb 18:01
Compare
Choose a tag to compare
  • improved link discovery and handling
  • fixes in metadata extraction, feeds and sitemaps processing
  • breaking change: the extract function now reads target format from output_format argument only
  • new extraction option: preserve links, CLI options re-ordered
  • more opportunistic backup extraction

trafilatura-0.7.0

04 Jan 14:00
Compare
Choose a tag to compare
  • customizable configuration file to parametrize extraction and downloads
  • better handling of feeds and sitemaps
  • additional CLI options: crytographic hash for file name, use Internet Archive as backup
  • more precise extraction
  • faster downloads: requests replaced with bare urllib3 and custom decoding
  • consolidation: bug fixes and improvements, many thanks to the issues reporters!

trafilatura-0.6.1

02 Dec 14:26
Compare
Choose a tag to compare
  • added bare_extraction function returning Python variables
  • improved link discovery in feeds and sitemaps
  • option to preserve image info
  • fixes (many thanks to bug reporters!)

trafilatura-0.6.0

06 Nov 15:18
Compare
Choose a tag to compare
  • link discovery in sitemaps
  • compatibility with Python 3.9
  • extraction coverage improved
  • deduplication now optional
  • bug fixes

trafilatura-0.5.2

22 Sep 11:31
Compare
Choose a tag to compare
  • optional language detector changed: langidpycld3
  • helper function bare_extraction()
  • optional deduplication off by default
  • better URL handling (courlan), more complete metadata
  • code consolidation (cleaner and shorter)

trafilatura-0.5.1

15 Jul 11:59
Compare
Choose a tag to compare
  • extended and more convenient command-line options
  • output in JSON format
  • bug fixes

trafilatura-0.5.0

02 Jun 17:10
Compare
Choose a tag to compare
  • faster and more robust text and metadata extraction
  • more efficient batch processing (parallel processing, URL queues)
  • support for ATOM/RSS feeds
  • complete command-line tool with corresponding options

trafilatura-0.4.1

24 Apr 10:46
Compare
Choose a tag to compare
  • better metadata extraction and integration (XML & XML-TEI)
  • more efficient processing
  • output directory as CLI-option

trafilatura-0.1.0

25 Sep 17:54
Compare
Choose a tag to compare

First release used in production and meant to be archived on Zenodo for reproducibility and citability.