Skip to content

feat: Add PydanticAiCrawler with AI-powered HTML extraction#1964

Open
Mantisus wants to merge 9 commits into
apify:masterfrom
Mantisus:llm-html-crawler
Open

feat: Add PydanticAiCrawler with AI-powered HTML extraction#1964
Mantisus wants to merge 9 commits into
apify:masterfrom
Mantisus:llm-html-crawler

Conversation

@Mantisus

@Mantisus Mantisus commented Jun 14, 2026

Copy link
Copy Markdown
Collaborator

Description

  • Adds PydanticAiCrawler - a new HTTP crawler that parses pages with parsel and uses pydantic-ai as the layer for LLM interaction.
  • PydanticAiHtmlDistiller is a protocol for distillers that clean HTML and convert it to a compact format (e.g., cleaned HTML, Markdown) for an LLM.
    • PydanticAiCleanHtmlDistiller removes comments, noisy attributes, and scripts, returning a compact HTML version.
    • PydanticAiSkeletonDistiller extends PydanticAiCleanHtmlDistiller by truncating text and collapsing repeated siblings.
  • PydanticAiHtmlExtractor is a protocol for extractors that turn a page into structured data using a distiller and an LLM.
    • PydanticAiDirectExtractor sends the distilled page to an LLM together with a Pydantic schema describing the target data and returns the validated result.
    • PydanticAiSelectorExtractor asks the LLM for CSS selectors once and caches them in a KeyValueStore, so later pages are extracted without an LLM call.

Issues

Testing

  • Added new unit tests for PydanticAiCrawler, PydanticAiCleanHtmlDistiller, PydanticAiSelectorExtractor, PydanticAiDirectExtractor, and PydanticAiSelectorExtractor.

@Mantisus Mantisus self-assigned this Jun 15, 2026
@Mantisus Mantisus marked this pull request as ready for review June 17, 2026 19:27
@vdusek vdusek requested a review from Copilot June 23, 2026 06:40

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces an experimental AiCrawler (HTTP-based, Parsel-backed) plus a small AI-extraction subsystem that can either (a) directly extract structured data via an LLM or (b) learn & cache CSS selectors via an LLM and reuse them on later pages to avoid repeated model calls. This adds a native AI/LLM extraction path for HTTP crawlers (issue #1593) and integrates it into the public crawlee.crawlers API and docs.

Changes:

  • Add AiCrawler + AiCrawlingContext (context.extract(...)) and new AI distiller/extractor abstractions (AiCleanHtmlDistiller, AiSkeletonDistiller, AiDirectExtractor, AiSelectorExtractor, AiUsageStats).
  • Add new optional dependency extra ai (parsel + lxml clean + pydantic-ai-slim[openai]) and include it in the all extra.
  • Add extensive unit tests and new documentation/guide + runnable code examples for common AI crawler setups.

Reviewed changes

Copilot reviewed 26 out of 27 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
uv.lock Adds locked deps for the new ai optional dependency group (and includes in all).
pyproject.toml Defines ai extra and adds it to all.
src/crawlee/crawlers/init.py Re-exports AI crawler/distiller/extractor APIs behind optional imports.
src/crawlee/crawlers/_ai/init.py AI module public surface with optional-import handling.
src/crawlee/crawlers/_ai/_ai_crawler.py Implements AiCrawler wiring + experimental warning + context pipeline integration.
src/crawlee/crawlers/_ai/_ai_crawling_context.py Adds AiCrawlingContext with extract helper and shared usage stats.
src/crawlee/crawlers/_ai/_base_distiller.py Base distiller + JSON-script protect/unprotect helpers.
src/crawlee/crawlers/_ai/_base_extractor.py Base extractor: model resolution, instruction composition, usage accumulation, scope helpers.
src/crawlee/crawlers/_ai/_clean_html_distiller.py Clean/distill HTML for direct LLM extraction (size caps, attr filtering, JSON handling).
src/crawlee/crawlers/_ai/_direct_extractor.py Direct extraction strategy using pydantic-ai output validation + usage tracking.
src/crawlee/crawlers/_ai/_prompts.py Shared prompt instructions/notes and truncation marker constants.
src/crawlee/crawlers/_ai/_selector_extractor.py Selector-learning extractor with caching, persistence, validation, retries, and fallback support.
src/crawlee/crawlers/_ai/_skeleton_distiller.py Skeleton distiller for selector generation (text truncation + sibling collapsing + max-size tightening).
src/crawlee/crawlers/_ai/_types.py Protocols for distillers/extractors and AiUsageStats.
src/crawlee/crawlers/_ai/_utils.py Utility to build a default lxml_html_clean.Cleaner for distillers.
tests/unit/crawlers/_ai/test_ai_crawler.py Unit tests for AiCrawler behavior and context/extractor forwarding.
tests/unit/crawlers/_ai/test_clean_html_distiller.py Unit tests for AiCleanHtmlDistiller reduction, truncation, and size enforcement.
tests/unit/crawlers/_ai/test_direct_extractor.py Unit tests for AiDirectExtractor prompt composition, scoping, retries, and usage.
tests/unit/crawlers/_ai/test_selector_extractor.py Unit tests for selector caching, concurrency, invalid plans/data retries, fallback, persistence.
tests/unit/crawlers/_ai/test_skeleton_distiller.py Unit tests for skeleton truncation, sibling collapsing, and oversize handling.
docs/guides/architecture_overview.mdx Updates architecture diagrams/text to include AiCrawler + AiCrawlingContext.
docs/guides/ai_crawler.mdx New user guide for installing and using AiCrawler, extractors, distillers, usage limits.
docs/guides/code_examples/ai_crawler/basic_example.py Example: basic AiCrawler usage.
docs/guides/code_examples/ai_crawler/additional_instructions_example.py Example: per-call additional_instructions.
docs/guides/code_examples/ai_crawler/custom_distiller_example.py Example: custom Markdown distiller.
docs/guides/code_examples/ai_crawler/selector_extractor_example.py Example: selector extractor + direct fallback.
docs/guides/code_examples/ai_crawler/usage_limit_example.py Example: per-run usage limits + cumulative token budget stopping.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/crawlee/crawlers/_ai/_selector_extractor.py Outdated
Comment thread src/crawlee/crawlers/_pydantic_ai/_selector_extractor.py Outdated
Comment thread src/crawlee/crawlers/_pydantic_ai/_selector_extractor.py
@B4nan

B4nan commented Jun 23, 2026

Copy link
Copy Markdown
Member

I kinda hate the name AiCrawler 🙃

@vdusek

vdusek commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

I kinda hate the name AiCrawler 🙃

Why? Is it too generic / "buzzword-y"?

If so, maybe we could consider something like PydanticAiCrawler + PydanticAiCrawlingContext. Since this is built on top of Pydantic and PydanticAI. Which might make sense, because it also highlights that selectors can be defined using Pydantic models (similar to ParselCrawler, BeautifulSoupCrawler, ...).

@B4nan

B4nan commented Jun 23, 2026

Copy link
Copy Markdown
Member

Yes, exactly, I would rather name it this way so it's clear what it does. AiCrawler sounds like a product name, not a library feature to me.

@Mantisus

Copy link
Copy Markdown
Collaborator Author

I kinda hate the name AiCrawler 🙃

Yeah, me too 😄. I chose it just as a "hyped" marketing name

If so, maybe we could consider something like PydanticAiCrawler + PydanticAiCrawlingContext.

I agree.

@vdusek vdusek left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! A few comments...

Comment thread src/crawlee/crawlers/_pydantic_ai/_skeleton_distiller.py
Comment thread src/crawlee/crawlers/_pydantic_ai/_selector_extractor.py
Comment thread docs/guides/ai_crawler.mdx Outdated
Comment on lines +86 to +96
### AiDirectExtractor

<ApiLink to="class/AiDirectExtractor">`AiDirectExtractor`</ApiLink> sends the distilled page to the model in one call. The schema is the model's output type. Pydantic AI validates the result; on a mismatch, it sends the error back to the model to fix, bounded by `retries`.

It reads each page on its own, so extraction is accurate per page. It accepts schemas of any shape: nested models, lists, dictionaries, unions, and deep nesting. The cost is one model call per page, which scales poorly on a large site.

Use `additional_instructions` to focus the model on the data you want:

<CodeBlock className="language-python">
{AdditionalInstructionsExample}
</CodeBlock>

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make it clear that AiDirectExtractor is the default, or explicitly include it in the code example. While reading, I wasn't sure whether it was intentional or if something was missing.

extractor=AiSelectorExtractor(
model=model,
# Pages the cached selectors cannot handle fall back to direct extraction.
fallback=AiDirectExtractor(model=model),

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the default behaviour? How can I provide a custom fallback? What can it be?

I have these questions after reading the doc guide.

If there are no other options for fallbacks, it can be just a boolean flag.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default behaviour without a fallback raises UnexpectedModelBehavior if it fails to generate selectors or ValueError if a complex, unsupported schema is passed.

A fallback could be another AiSelectorExtractor with a different model, distiller, or instructions, or a AiDirectExtractor with custom settings. We can't reduce it to a boolean flag.

Comment thread docs/guides/ai_crawler.mdx Outdated

A distiller reduces raw HTML to a compact representation the model reads cheaply. Each extractor uses one. Replace it with the extractor's `distiller` argument (the crawler itself has no `distiller` argument).

<ApiLink to="class/AiDirectExtractor">`AiDirectExtractor`</ApiLink> defaults to an <ApiLink to="class/AiCleanHtmlDistiller">`AiCleanHtmlDistiller`</ApiLink>: cleaned, structure-preserving HTML that keeps the full page text. <ApiLink to="class/AiSelectorExtractor">`AiSelectorExtractor`</ApiLink> uses an <ApiLink to="class/AiSkeletonDistiller">`AiSkeletonDistiller`</ApiLink> internally to ask the model for selectors; you rarely set it yourself.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if I use AiCleanHtmlDistiller in AiSelectorExtractor? And/or AiSkeletonDistiller in AiDirectExtractor? I also don't know what the difference between them is. I would expect this section to follow this structure:

## Distillers

...

### AiCleanHtmlDistiller

...

### AiSkeletonDistiller

...

### Custom distiller

...

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if I use AiCleanHtmlDistiller in AiSelectorExtractor?

Using AiCleanHtmlDistiller with AiSelectorExtractor can improve extraction because the page keeps more data. That likely makes selector generation easier for the model. The price is more tokens that are wasted if the generation still fails.

And/or AiSkeletonDistiller in AiDirectExtractor?

AiSkeletonDistiller with AiDirectExtractor is trickier. If the data is in the text, the model can't read it correctly because the text is truncated. But if the data is in attributes, it works and saves tokens.

The best distiller depends on the site and the task. For example, for news sites where the goal is to extract and organize the visible text, a MarkdownDistiller (the custom distiller example) with PydanticAiDirectExtractor will probably work best.

Mixing different extractors and distillers lets users find the optimal balance between extraction quality and the resources it costs.

@Mantisus Mantisus changed the title feat: Add AiCrawler with AI-powered HTML extraction feat: Add PydanticAiCrawler with AI-powered HTML extraction Jun 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add support for AI/LLM-based HTML parsing (selectors)

5 participants