Features
- Nicer messages explaining why an article was marked as invalid
- Added
saveSnapshotsOfInvalidArticles
option to input
Features
- Added
enqueueFromArticles
option to enqueue articles from article pages to get even more articles from the website. You need to enable it in input. - Added
scanSitemaps
andsitemapUrls
parameters.scanSitemaps
automatically searches sitemaps for articles for each start URL andsitemapUrls
allows you to add the sitemaps manually if necessary. Be careful thatscanSitemaps
may dump a huge amount of (sometimes old) article URLs into the scraping process
Fixes
onlyNewArticles
andonlyNewArticlesPerDomain
was loading duplicate items which caused excess usage of dataset read.
Features
- Added new input option
onlyNewArticlesPerDomain
. This is much more efficient way to deduplicate articles, so use it instead ofonlyNewArticles
. onlyNewArticlesPerDomain
works also on local datasets
- Fix: Now works with Start URLs from a public spreadsheet
- Upgraded Apify version
0.21.0
that sometimes crashed at the start of the run - Added
currentItem
param toextendOutputFunction
- Improved logs
- Increased request timeouts to work better on very slow sites
- Added option to run with browser (Puppeteer)
- Added option to wait for page load or for selector (browser only)
- Added
articleUrls
directly as input option to parse directly on articles