Improve HTML to markdown processing. It seems that Website Content Crawler handle this better #46

jirispilka · 2025-02-04T20:32:25Z

Context:

The RAG Web Browser has a slightly different configuration. To keep settings simple, it outputs raw page content without transformation, unlike the Website Content Crawler, which uses the readableText option. This option can sometimes remove content and isn’t 100% reliable. Instead, in RAG Web Browser, we let the LLM determine what content is useful by setting "htmlTransformer": "none".

When I run Website Content Crawler with "htmlTransformer": "none", I receive similar output to the RAG Web Browser.

RAG Web Browser: run
"Apify: Full-stack web scraping and data extraction platformStar apify/crawlee on GitHubRib StichBack ButtonSearch IconFilter Icon\n\nSkip to content
Website Content Crawler: run
"Apify: Full-stack web scraping and data extraction platform\n\nSkip to content
Interestingly, there is a bit more processing Website Content Crawler is doing. If you want both Actors to produce identical output, it should be possible. However, I encountered an issue when testing this and couldn't quickly figure out the cause.

https://console.apify.com/actors/3ox4R101TgZz67sLr/issues/m0PskMduUcizPeTVn

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve HTML to markdown processing. It seems that Website Content Crawler handle this better #46

Improve HTML to markdown processing. It seems that Website Content Crawler handle this better #46

jirispilka commented Feb 4, 2025

Improve HTML to markdown processing. It seems that Website Content Crawler handle this better #46

Improve HTML to markdown processing. It seems that Website Content Crawler handle this better #46

Comments

jirispilka commented Feb 4, 2025