feat: Add PydanticAiCrawler with AI-powered HTML extraction#1964
feat: Add PydanticAiCrawler with AI-powered HTML extraction#1964Mantisus wants to merge 9 commits into
PydanticAiCrawler with AI-powered HTML extraction#1964Conversation
There was a problem hiding this comment.
Pull request overview
This PR introduces an experimental AiCrawler (HTTP-based, Parsel-backed) plus a small AI-extraction subsystem that can either (a) directly extract structured data via an LLM or (b) learn & cache CSS selectors via an LLM and reuse them on later pages to avoid repeated model calls. This adds a native AI/LLM extraction path for HTTP crawlers (issue #1593) and integrates it into the public crawlee.crawlers API and docs.
Changes:
- Add
AiCrawler+AiCrawlingContext(context.extract(...)) and new AI distiller/extractor abstractions (AiCleanHtmlDistiller,AiSkeletonDistiller,AiDirectExtractor,AiSelectorExtractor,AiUsageStats). - Add new optional dependency extra
ai(parsel + lxml clean + pydantic-ai-slim[openai]) and include it in theallextra. - Add extensive unit tests and new documentation/guide + runnable code examples for common AI crawler setups.
Reviewed changes
Copilot reviewed 26 out of 27 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| uv.lock | Adds locked deps for the new ai optional dependency group (and includes in all). |
| pyproject.toml | Defines ai extra and adds it to all. |
| src/crawlee/crawlers/init.py | Re-exports AI crawler/distiller/extractor APIs behind optional imports. |
| src/crawlee/crawlers/_ai/init.py | AI module public surface with optional-import handling. |
| src/crawlee/crawlers/_ai/_ai_crawler.py | Implements AiCrawler wiring + experimental warning + context pipeline integration. |
| src/crawlee/crawlers/_ai/_ai_crawling_context.py | Adds AiCrawlingContext with extract helper and shared usage stats. |
| src/crawlee/crawlers/_ai/_base_distiller.py | Base distiller + JSON-script protect/unprotect helpers. |
| src/crawlee/crawlers/_ai/_base_extractor.py | Base extractor: model resolution, instruction composition, usage accumulation, scope helpers. |
| src/crawlee/crawlers/_ai/_clean_html_distiller.py | Clean/distill HTML for direct LLM extraction (size caps, attr filtering, JSON handling). |
| src/crawlee/crawlers/_ai/_direct_extractor.py | Direct extraction strategy using pydantic-ai output validation + usage tracking. |
| src/crawlee/crawlers/_ai/_prompts.py | Shared prompt instructions/notes and truncation marker constants. |
| src/crawlee/crawlers/_ai/_selector_extractor.py | Selector-learning extractor with caching, persistence, validation, retries, and fallback support. |
| src/crawlee/crawlers/_ai/_skeleton_distiller.py | Skeleton distiller for selector generation (text truncation + sibling collapsing + max-size tightening). |
| src/crawlee/crawlers/_ai/_types.py | Protocols for distillers/extractors and AiUsageStats. |
| src/crawlee/crawlers/_ai/_utils.py | Utility to build a default lxml_html_clean.Cleaner for distillers. |
| tests/unit/crawlers/_ai/test_ai_crawler.py | Unit tests for AiCrawler behavior and context/extractor forwarding. |
| tests/unit/crawlers/_ai/test_clean_html_distiller.py | Unit tests for AiCleanHtmlDistiller reduction, truncation, and size enforcement. |
| tests/unit/crawlers/_ai/test_direct_extractor.py | Unit tests for AiDirectExtractor prompt composition, scoping, retries, and usage. |
| tests/unit/crawlers/_ai/test_selector_extractor.py | Unit tests for selector caching, concurrency, invalid plans/data retries, fallback, persistence. |
| tests/unit/crawlers/_ai/test_skeleton_distiller.py | Unit tests for skeleton truncation, sibling collapsing, and oversize handling. |
| docs/guides/architecture_overview.mdx | Updates architecture diagrams/text to include AiCrawler + AiCrawlingContext. |
| docs/guides/ai_crawler.mdx | New user guide for installing and using AiCrawler, extractors, distillers, usage limits. |
| docs/guides/code_examples/ai_crawler/basic_example.py | Example: basic AiCrawler usage. |
| docs/guides/code_examples/ai_crawler/additional_instructions_example.py | Example: per-call additional_instructions. |
| docs/guides/code_examples/ai_crawler/custom_distiller_example.py | Example: custom Markdown distiller. |
| docs/guides/code_examples/ai_crawler/selector_extractor_example.py | Example: selector extractor + direct fallback. |
| docs/guides/code_examples/ai_crawler/usage_limit_example.py | Example: per-run usage limits + cumulative token budget stopping. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
I kinda hate the name |
Why? Is it too generic / "buzzword-y"? If so, maybe we could consider something like |
|
Yes, exactly, I would rather name it this way so it's clear what it does. AiCrawler sounds like a product name, not a library feature to me. |
Yeah, me too 😄. I chose it just as a "hyped" marketing name
I agree. |
vdusek
left a comment
There was a problem hiding this comment.
Looks good! A few comments...
| ### AiDirectExtractor | ||
|
|
||
| <ApiLink to="class/AiDirectExtractor">`AiDirectExtractor`</ApiLink> sends the distilled page to the model in one call. The schema is the model's output type. Pydantic AI validates the result; on a mismatch, it sends the error back to the model to fix, bounded by `retries`. | ||
|
|
||
| It reads each page on its own, so extraction is accurate per page. It accepts schemas of any shape: nested models, lists, dictionaries, unions, and deep nesting. The cost is one model call per page, which scales poorly on a large site. | ||
|
|
||
| Use `additional_instructions` to focus the model on the data you want: | ||
|
|
||
| <CodeBlock className="language-python"> | ||
| {AdditionalInstructionsExample} | ||
| </CodeBlock> |
There was a problem hiding this comment.
We should make it clear that AiDirectExtractor is the default, or explicitly include it in the code example. While reading, I wasn't sure whether it was intentional or if something was missing.
| extractor=AiSelectorExtractor( | ||
| model=model, | ||
| # Pages the cached selectors cannot handle fall back to direct extraction. | ||
| fallback=AiDirectExtractor(model=model), |
There was a problem hiding this comment.
What is the default behaviour? How can I provide a custom fallback? What can it be?
I have these questions after reading the doc guide.
If there are no other options for fallbacks, it can be just a boolean flag.
There was a problem hiding this comment.
The default behaviour without a fallback raises UnexpectedModelBehavior if it fails to generate selectors or ValueError if a complex, unsupported schema is passed.
A fallback could be another AiSelectorExtractor with a different model, distiller, or instructions, or a AiDirectExtractor with custom settings. We can't reduce it to a boolean flag.
|
|
||
| A distiller reduces raw HTML to a compact representation the model reads cheaply. Each extractor uses one. Replace it with the extractor's `distiller` argument (the crawler itself has no `distiller` argument). | ||
|
|
||
| <ApiLink to="class/AiDirectExtractor">`AiDirectExtractor`</ApiLink> defaults to an <ApiLink to="class/AiCleanHtmlDistiller">`AiCleanHtmlDistiller`</ApiLink>: cleaned, structure-preserving HTML that keeps the full page text. <ApiLink to="class/AiSelectorExtractor">`AiSelectorExtractor`</ApiLink> uses an <ApiLink to="class/AiSkeletonDistiller">`AiSkeletonDistiller`</ApiLink> internally to ask the model for selectors; you rarely set it yourself. |
There was a problem hiding this comment.
What happens if I use AiCleanHtmlDistiller in AiSelectorExtractor? And/or AiSkeletonDistiller in AiDirectExtractor? I also don't know what the difference between them is. I would expect this section to follow this structure:
## Distillers
...
### AiCleanHtmlDistiller
...
### AiSkeletonDistiller
...
### Custom distiller
...
There was a problem hiding this comment.
What happens if I use
AiCleanHtmlDistillerinAiSelectorExtractor?
Using AiCleanHtmlDistiller with AiSelectorExtractor can improve extraction because the page keeps more data. That likely makes selector generation easier for the model. The price is more tokens that are wasted if the generation still fails.
And/or
AiSkeletonDistillerinAiDirectExtractor?
AiSkeletonDistiller with AiDirectExtractor is trickier. If the data is in the text, the model can't read it correctly because the text is truncated. But if the data is in attributes, it works and saves tokens.
The best distiller depends on the site and the task. For example, for news sites where the goal is to extract and organize the visible text, a MarkdownDistiller (the custom distiller example) with PydanticAiDirectExtractor will probably work best.
Mixing different extractors and distillers lets users find the optimal balance between extraction quality and the resources it costs.
AiCrawler with AI-powered HTML extractionPydanticAiCrawler with AI-powered HTML extraction
Description
PydanticAiCrawler- a new HTTP crawler that parses pages withparseland usespydantic-aias the layer for LLM interaction.PydanticAiHtmlDistilleris a protocol for distillers that clean HTML and convert it to a compact format (e.g., cleaned HTML, Markdown) for an LLM.PydanticAiCleanHtmlDistillerremoves comments, noisy attributes, and scripts, returning a compact HTML version.PydanticAiSkeletonDistillerextendsPydanticAiCleanHtmlDistillerby truncating text and collapsing repeated siblings.PydanticAiHtmlExtractoris a protocol for extractors that turn a page into structured data using a distiller and an LLM.PydanticAiDirectExtractorsends the distilled page to an LLM together with a Pydantic schema describing the target data and returns the validated result.PydanticAiSelectorExtractorasks the LLM for CSS selectors once and caches them in aKeyValueStore, so later pages are extracted without an LLM call.Issues
Testing
PydanticAiCrawler,PydanticAiCleanHtmlDistiller,PydanticAiSelectorExtractor,PydanticAiDirectExtractor, andPydanticAiSelectorExtractor.