feat: Add `PydanticAiCrawler` with AI-powered HTML extraction by Mantisus · Pull Request #1964 · apify/crawlee-python

Mantisus · 2026-06-14T17:44:22Z

Description

Adds PydanticAiCrawler - a new HTTP crawler that parses pages with parsel and uses pydantic-ai as the layer for LLM interaction.
PydanticAiHtmlDistiller is a protocol for distillers that clean HTML and convert it to a compact format (e.g., cleaned HTML, Markdown) for an LLM.
- PydanticAiCleanHtmlDistiller removes comments, noisy attributes, and scripts, returning a compact HTML version.
- PydanticAiSkeletonDistiller extends PydanticAiCleanHtmlDistiller by truncating text and collapsing repeated siblings.
PydanticAiHtmlExtractor is a protocol for extractors that turn a page into structured data using a distiller and an LLM.
- PydanticAiDirectExtractor sends the distilled page to an LLM together with a Pydantic schema describing the target data and returns the validated result.
- PydanticAiSelectorExtractor asks the LLM for CSS selectors once and caches them in a KeyValueStore, so later pages are extracted without an LLM call.

Issues

Closes: Add support for AI/LLM-based HTML parsing (selectors) #1593

Testing

Added new unit tests for PydanticAiCrawler, PydanticAiCleanHtmlDistiller, PydanticAiSelectorExtractor, PydanticAiDirectExtractor, and PydanticAiSelectorExtractor.

Copilot

Pull request overview

This PR introduces an experimental AiCrawler (HTTP-based, Parsel-backed) plus a small AI-extraction subsystem that can either (a) directly extract structured data via an LLM or (b) learn & cache CSS selectors via an LLM and reuse them on later pages to avoid repeated model calls. This adds a native AI/LLM extraction path for HTTP crawlers (issue #1593) and integrates it into the public crawlee.crawlers API and docs.

Changes:

Add AiCrawler + AiCrawlingContext (context.extract(...)) and new AI distiller/extractor abstractions (AiCleanHtmlDistiller, AiSkeletonDistiller, AiDirectExtractor, AiSelectorExtractor, AiUsageStats).
Add new optional dependency extra ai (parsel + lxml clean + pydantic-ai-slim[openai]) and include it in the all extra.
Add extensive unit tests and new documentation/guide + runnable code examples for common AI crawler setups.

Reviewed changes

Copilot reviewed 26 out of 27 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
uv.lock	Adds locked deps for the new `ai` optional dependency group (and includes in `all`).
pyproject.toml	Defines `ai` extra and adds it to `all`.
src/crawlee/crawlers/init.py	Re-exports AI crawler/distiller/extractor APIs behind optional imports.
src/crawlee/crawlers/_ai/init.py	AI module public surface with optional-import handling.
src/crawlee/crawlers/_ai/_ai_crawler.py	Implements `AiCrawler` wiring + experimental warning + context pipeline integration.
src/crawlee/crawlers/_ai/_ai_crawling_context.py	Adds `AiCrawlingContext` with `extract` helper and shared usage stats.
src/crawlee/crawlers/_ai/_base_distiller.py	Base distiller + JSON-script protect/unprotect helpers.
src/crawlee/crawlers/_ai/_base_extractor.py	Base extractor: model resolution, instruction composition, usage accumulation, scope helpers.
src/crawlee/crawlers/_ai/_clean_html_distiller.py	Clean/distill HTML for direct LLM extraction (size caps, attr filtering, JSON handling).
src/crawlee/crawlers/_ai/_direct_extractor.py	Direct extraction strategy using pydantic-ai output validation + usage tracking.
src/crawlee/crawlers/_ai/_prompts.py	Shared prompt instructions/notes and truncation marker constants.
src/crawlee/crawlers/_ai/_selector_extractor.py	Selector-learning extractor with caching, persistence, validation, retries, and fallback support.
src/crawlee/crawlers/_ai/_skeleton_distiller.py	Skeleton distiller for selector generation (text truncation + sibling collapsing + max-size tightening).
src/crawlee/crawlers/_ai/_types.py	Protocols for distillers/extractors and `AiUsageStats`.
src/crawlee/crawlers/_ai/_utils.py	Utility to build a default `lxml_html_clean.Cleaner` for distillers.
tests/unit/crawlers/_ai/test_ai_crawler.py	Unit tests for `AiCrawler` behavior and context/extractor forwarding.
tests/unit/crawlers/_ai/test_clean_html_distiller.py	Unit tests for `AiCleanHtmlDistiller` reduction, truncation, and size enforcement.
tests/unit/crawlers/_ai/test_direct_extractor.py	Unit tests for `AiDirectExtractor` prompt composition, scoping, retries, and usage.
tests/unit/crawlers/_ai/test_selector_extractor.py	Unit tests for selector caching, concurrency, invalid plans/data retries, fallback, persistence.
tests/unit/crawlers/_ai/test_skeleton_distiller.py	Unit tests for skeleton truncation, sibling collapsing, and oversize handling.
docs/guides/architecture_overview.mdx	Updates architecture diagrams/text to include `AiCrawler` + `AiCrawlingContext`.
docs/guides/ai_crawler.mdx	New user guide for installing and using `AiCrawler`, extractors, distillers, usage limits.
docs/guides/code_examples/ai_crawler/basic_example.py	Example: basic `AiCrawler` usage.
docs/guides/code_examples/ai_crawler/additional_instructions_example.py	Example: per-call `additional_instructions`.
docs/guides/code_examples/ai_crawler/custom_distiller_example.py	Example: custom Markdown distiller.
docs/guides/code_examples/ai_crawler/selector_extractor_example.py	Example: selector extractor + direct fallback.
docs/guides/code_examples/ai_crawler/usage_limit_example.py	Example: per-run usage limits + cumulative token budget stopping.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

B4nan · 2026-06-23T07:27:56Z

I kinda hate the name AiCrawler 🙃

vdusek · 2026-06-23T10:07:23Z

I kinda hate the name AiCrawler 🙃

Why? Is it too generic / "buzzword-y"?

If so, maybe we could consider something like PydanticAiCrawler + PydanticAiCrawlingContext. Since this is built on top of Pydantic and PydanticAI. Which might make sense, because it also highlights that selectors can be defined using Pydantic models (similar to ParselCrawler, BeautifulSoupCrawler, ...).

B4nan · 2026-06-23T10:16:26Z

Yes, exactly, I would rather name it this way so it's clear what it does. AiCrawler sounds like a product name, not a library feature to me.

Mantisus · 2026-06-23T11:08:21Z

I kinda hate the name AiCrawler 🙃

Yeah, me too 😄. I chose it just as a "hyped" marketing name

If so, maybe we could consider something like PydanticAiCrawler + PydanticAiCrawlingContext.

I agree.

vdusek

Looks good! A few comments...

vdusek · 2026-06-23T10:13:45Z

+### AiDirectExtractor
+
+<ApiLink to="class/AiDirectExtractor">`AiDirectExtractor`</ApiLink> sends the distilled page to the model in one call. The schema is the model's output type. Pydantic AI validates the result; on a mismatch, it sends the error back to the model to fix, bounded by `retries`.
+
+It reads each page on its own, so extraction is accurate per page. It accepts schemas of any shape: nested models, lists, dictionaries, unions, and deep nesting. The cost is one model call per page, which scales poorly on a large site.
+
+Use `additional_instructions` to focus the model on the data you want:
+
+<CodeBlock className="language-python">
+    {AdditionalInstructionsExample}
+</CodeBlock>


We should make it clear that AiDirectExtractor is the default, or explicitly include it in the code example. While reading, I wasn't sure whether it was intentional or if something was missing.

vdusek · 2026-06-23T10:14:38Z

+        extractor=AiSelectorExtractor(
+            model=model,
+            # Pages the cached selectors cannot handle fall back to direct extraction.
+            fallback=AiDirectExtractor(model=model),


What is the default behaviour? How can I provide a custom fallback? What can it be?

I have these questions after reading the doc guide.

If there are no other options for fallbacks, it can be just a boolean flag.

The default behaviour without a fallback raises UnexpectedModelBehavior if it fails to generate selectors or ValueError if a complex, unsupported schema is passed.

A fallback could be another AiSelectorExtractor with a different model, distiller, or instructions, or a AiDirectExtractor with custom settings. We can't reduce it to a boolean flag.

vdusek · 2026-06-23T10:17:16Z

+
+A distiller reduces raw HTML to a compact representation the model reads cheaply. Each extractor uses one. Replace it with the extractor's `distiller` argument (the crawler itself has no `distiller` argument).
+
+<ApiLink to="class/AiDirectExtractor">`AiDirectExtractor`</ApiLink> defaults to an <ApiLink to="class/AiCleanHtmlDistiller">`AiCleanHtmlDistiller`</ApiLink>: cleaned, structure-preserving HTML that keeps the full page text. <ApiLink to="class/AiSelectorExtractor">`AiSelectorExtractor`</ApiLink> uses an <ApiLink to="class/AiSkeletonDistiller">`AiSkeletonDistiller`</ApiLink> internally to ask the model for selectors; you rarely set it yourself.


What happens if I use AiCleanHtmlDistiller in AiSelectorExtractor? And/or AiSkeletonDistiller in AiDirectExtractor? I also don't know what the difference between them is. I would expect this section to follow this structure:

## Distillers ... ### AiCleanHtmlDistiller ... ### AiSkeletonDistiller ... ### Custom distiller ...

What happens if I use AiCleanHtmlDistiller in AiSelectorExtractor?

Using AiCleanHtmlDistiller with AiSelectorExtractor can improve extraction because the page keeps more data. That likely makes selector generation easier for the model. The price is more tokens that are wasted if the generation still fails.

And/or AiSkeletonDistiller in AiDirectExtractor?

AiSkeletonDistiller with AiDirectExtractor is trickier. If the data is in the text, the model can't read it correctly because the text is truncated. But if the data is in attributes, it works and saves tokens.

The best distiller depends on the site and the task. For example, for news sites where the goal is to extract and organize the visible text, a MarkdownDistiller (the custom distiller example) with PydanticAiDirectExtractor will probably work best.

Mixing different extractors and distillers lets users find the optimal balance between extraction quality and the resources it costs.

Add AiCrawler with AI-powered HTML extraction

06f78f2

Mantisus self-assigned this Jun 15, 2026

Mantisus added 5 commits June 15, 2026 15:17

Merge branch 'master' into llm-html-crawler

f350c8a

Merge branch 'master' into llm-html-crawler

5c25ab3

add tests

490b645

add docs

7f496a6

Merge branch 'master' into llm-html-crawler

2a15e6e

Mantisus marked this pull request as ready for review June 17, 2026 19:27

Mantisus requested review from Pijukatel, janbuchar and vdusek June 17, 2026 19:27

vdusek requested a review from Copilot June 23, 2026 06:40

Copilot started reviewing on behalf of vdusek June 23, 2026 06:40 View session

Copilot AI reviewed Jun 23, 2026

View reviewed changes

Comment thread src/crawlee/crawlers/_ai/_selector_extractor.py Outdated

Comment thread src/crawlee/crawlers/_pydantic_ai/_selector_extractor.py Outdated

Comment thread src/crawlee/crawlers/_pydantic_ai/_selector_extractor.py

vdusek reviewed Jun 23, 2026

View reviewed changes

Mantisus added 2 commits June 23, 2026 16:14

rename

f7cfe52

fix

624b181

Mantisus changed the title ~~feat: Add AiCrawler with AI-powered HTML extraction~~ feat: Add PydanticAiCrawler with AI-powered HTML extraction Jun 23, 2026

update docs

66cd04f


		A distiller reduces raw HTML to a compact representation the model reads cheaply. Each extractor uses one. Replace it with the extractor's `distiller` argument (the crawler itself has no `distiller` argument).

		<ApiLink to="class/AiDirectExtractor">`AiDirectExtractor`</ApiLink> defaults to an <ApiLink to="class/AiCleanHtmlDistiller">`AiCleanHtmlDistiller`</ApiLink>: cleaned, structure-preserving HTML that keeps the full page text. <ApiLink to="class/AiSelectorExtractor">`AiSelectorExtractor`</ApiLink> uses an <ApiLink to="class/AiSkeletonDistiller">`AiSkeletonDistiller`</ApiLink> internally to ask the model for selectors; you rarely set it yourself.

Conversation

Mantisus commented Jun 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Issues

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

B4nan commented Jun 23, 2026

Uh oh!

vdusek commented Jun 23, 2026

Uh oh!

B4nan commented Jun 23, 2026

Uh oh!

Mantisus commented Jun 23, 2026

Uh oh!

vdusek left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

vdusek Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

vdusek Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

Mantisus Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

vdusek Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

Mantisus Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Mantisus commented Jun 14, 2026 •

edited

Loading