Skip to content

Improve HTML to markdown processing. It seems that Website Content Crawler handle this better #46

@jirispilka

Description

@jirispilka

Context:

The RAG Web Browser has a slightly different configuration. To keep settings simple, it outputs raw page content without transformation, unlike the Website Content Crawler, which uses the readableText option. This option can sometimes remove content and isn’t 100% reliable. Instead, in RAG Web Browser, we let the LLM determine what content is useful by setting "htmlTransformer": "none".

When I run Website Content Crawler with "htmlTransformer": "none", I receive similar output to the RAG Web Browser.

RAG Web Browser: run
"Apify: Full-stack web scraping and data extraction platformStar apify/crawlee on GitHubRib StichBack ButtonSearch IconFilter Icon\n\nSkip to content
Website Content Crawler: run
"Apify: Full-stack web scraping and data extraction platform\n\nSkip to content
Interestingly, there is a bit more processing Website Content Crawler is doing. If you want both Actors to produce identical output, it should be possible. However, I encountered an issue when testing this and couldn't quickly figure out the cause.

https://console.apify.com/actors/3ox4R101TgZz67sLr/issues/m0PskMduUcizPeTVn

Metadata

Metadata

Assignees

No one assigned

    Labels

    t-aiIssues owned by the AI team.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions