You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The RAG Web Browser has a slightly different configuration. To keep settings simple, it outputs raw page content without transformation, unlike the Website Content Crawler, which uses the readableText option. This option can sometimes remove content and isn’t 100% reliable. Instead, in RAG Web Browser, we let the LLM determine what content is useful by setting "htmlTransformer": "none".
When I run Website Content Crawler with "htmlTransformer": "none", I receive similar output to the RAG Web Browser.
RAG Web Browser: run
"Apify: Full-stack web scraping and data extraction platformStar apify/crawlee on GitHubRib StichBack ButtonSearch IconFilter Icon\n\nSkip to content
Website Content Crawler: run
"Apify: Full-stack web scraping and data extraction platform\n\nSkip to content
Interestingly, there is a bit more processing Website Content Crawler is doing. If you want both Actors to produce identical output, it should be possible. However, I encountered an issue when testing this and couldn't quickly figure out the cause.
Context:
The RAG Web Browser has a slightly different configuration. To keep settings simple, it outputs raw page content without transformation, unlike the Website Content Crawler, which uses the readableText option. This option can sometimes remove content and isn’t 100% reliable. Instead, in RAG Web Browser, we let the LLM determine what content is useful by setting "htmlTransformer": "none".
When I run Website Content Crawler with "htmlTransformer": "none", I receive similar output to the RAG Web Browser.
RAG Web Browser: run
"Apify: Full-stack web scraping and data extraction platformStar apify/crawlee on GitHubRib StichBack ButtonSearch IconFilter Icon\n\nSkip to content
Website Content Crawler: run
"Apify: Full-stack web scraping and data extraction platform\n\nSkip to content
Interestingly, there is a bit more processing Website Content Crawler is doing. If you want both Actors to produce identical output, it should be possible. However, I encountered an issue when testing this and couldn't quickly figure out the cause.
https://console.apify.com/actors/3ox4R101TgZz67sLr/issues/m0PskMduUcizPeTVn
The text was updated successfully, but these errors were encountered: