-
Notifications
You must be signed in to change notification settings - Fork 10
Description
Context:
The RAG Web Browser has a slightly different configuration. To keep settings simple, it outputs raw page content without transformation, unlike the Website Content Crawler, which uses the readableText option. This option can sometimes remove content and isn’t 100% reliable. Instead, in RAG Web Browser, we let the LLM determine what content is useful by setting "htmlTransformer": "none".
When I run Website Content Crawler with "htmlTransformer": "none", I receive similar output to the RAG Web Browser.
RAG Web Browser: run
"Apify: Full-stack web scraping and data extraction platformStar apify/crawlee on GitHubRib StichBack ButtonSearch IconFilter Icon\n\nSkip to content
Website Content Crawler: run
"Apify: Full-stack web scraping and data extraction platform\n\nSkip to content
Interestingly, there is a bit more processing Website Content Crawler is doing. If you want both Actors to produce identical output, it should be possible. However, I encountered an issue when testing this and couldn't quickly figure out the cause.
https://console.apify.com/actors/3ox4R101TgZz67sLr/issues/m0PskMduUcizPeTVn