Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve HTML to markdown processing. It seems that Website Content Crawler handle this better #46

Open
jirispilka opened this issue Feb 4, 2025 · 0 comments

Comments

@jirispilka
Copy link
Collaborator

Context:

The RAG Web Browser has a slightly different configuration. To keep settings simple, it outputs raw page content without transformation, unlike the Website Content Crawler, which uses the readableText option. This option can sometimes remove content and isn’t 100% reliable. Instead, in RAG Web Browser, we let the LLM determine what content is useful by setting "htmlTransformer": "none".

When I run Website Content Crawler with "htmlTransformer": "none", I receive similar output to the RAG Web Browser.

RAG Web Browser: run
"Apify: Full-stack web scraping and data extraction platformStar apify/crawlee on GitHubRib StichBack ButtonSearch IconFilter Icon\n\nSkip to content
Website Content Crawler: run
"Apify: Full-stack web scraping and data extraction platform\n\nSkip to content
Interestingly, there is a bit more processing Website Content Crawler is doing. If you want both Actors to produce identical output, it should be possible. However, I encountered an issue when testing this and couldn't quickly figure out the cause.

https://console.apify.com/actors/3ox4R101TgZz67sLr/issues/m0PskMduUcizPeTVn

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant