feat: Add StagehandCrawler with AI-powered browser automation#1854
feat: Add StagehandCrawler with AI-powered browser automation#1854Mantisus wants to merge 18 commits intoapify:masterfrom
StagehandCrawler with AI-powered browser automation#1854Conversation
Co-authored-by: Copilot <copilot@github.com>
There was a problem hiding this comment.
Pull request overview
Adds first-class Stagehand integration to Crawlee Python by introducing a StagehandCrawler (built on PlaywrightCrawler) plus corresponding browser-pool plugin/controller, enabling AI-driven page actions (act, extract, observe, execute) while keeping Crawlee’s existing routing/sessions/proxy/navigation-hook features.
Changes:
- Introduces
StagehandCrawler+ Stagehand-specific crawling contexts and exports them fromcrawlee.crawlers. - Adds
StagehandBrowserPlugin/StagehandBrowserController,StagehandOptions, andStagehandPage, integrated withBrowserPool. - Adds Stagehand documentation + examples, updates architecture docs, and replaces the older “Playwright with Stagehand” guide; updates dependencies and adds unit tests.
Reviewed changes
Copilot reviewed 21 out of 23 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
uv.lock |
Locks new optional Stagehand dependency set and adds stagehand extra resolution entries. |
pyproject.toml |
Adds stagehand optional dependency group and includes it in all. |
src/crawlee/browsers/__init__.py |
Exposes Stagehand browser plugin/controller and types via optional imports. |
src/crawlee/browsers/_stagehand_types.py |
Defines StagehandOptions and StagehandPage AI-method wrappers. |
src/crawlee/browsers/_stagehand_browser_plugin.py |
Implements StagehandBrowserPlugin lifecycle and Stagehand client initialization. |
src/crawlee/browsers/_stagehand_browser_controller.py |
Implements CDP connection + lazy session start, page creation, and header injection for Stagehand. |
src/crawlee/crawlers/__init__.py |
Exposes Stagehand crawler + contexts via optional imports. |
src/crawlee/crawlers/_stagehand/__init__.py |
Adds Stagehand crawler module exports with optional-deps handling. |
src/crawlee/crawlers/_stagehand/_stagehand_crawler.py |
Adds StagehandCrawler built on PlaywrightCrawler and auto-configures a Stagehand BrowserPool. |
src/crawlee/crawlers/_stagehand/_stagehand_crawling_context.py |
Adds Stagehand-specific crawling context dataclasses and type-narrowed page. |
src/crawlee/crawlers/_playwright/_playwright_crawler.py |
Refactors Playwright crawler to support overridable context classes and generic context typing via _build_context. |
tests/unit/browsers/test_stagehand_browser_plugin.py |
Adds unit tests for plugin activation and Stagehand client init parameter wiring. |
tests/unit/browsers/test_stagehand_browser_controller.py |
Adds unit tests for lazy session start, concurrency behavior, proxies, and header behavior. |
tests/unit/crawlers/_stagehand/test_stagehand_crawler.py |
Adds unit tests verifying context types, hook contexts, and StagehandPage AI-method delegation. |
docs/guides/stagehand_crawler.mdx |
New guide documenting StagehandCrawler, options, AI methods, and Browserbase usage. |
docs/guides/code_examples/stagehand_crawler/basic_example.py |
Example demonstrating act() + extract() with JSON schema. |
docs/guides/code_examples/stagehand_crawler/browserbase_example.py |
Example demonstrating Browserbase environment configuration. |
docs/guides/playwright_crawler_stagehand.mdx |
Removes old guide that described manual Stagehand integration with PlaywrightCrawler. |
docs/guides/code_examples/playwright_crawler_stagehand/support_classes.py |
Removes old example support classes for the manual Stagehand integration. |
docs/guides/code_examples/playwright_crawler_stagehand/browser_classes.py |
Removes old example browser plugin/controller classes for the manual Stagehand integration. |
docs/guides/code_examples/playwright_crawler_stagehand/stagehand_run.py |
Removes old “manual integration” runnable example. |
docs/guides/architecture_overview.mdx |
Updates architecture diagrams/text to include StagehandCrawler + contexts. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Docs check fails due to the current versioning logic. |
vdusek
left a comment
There was a problem hiding this comment.
Mostly doc-related / style things Maybe you could also align the `.rules.md. file (about the double backticks and line width for docstrings).
| """Browserbase project ID, required when `env='BROWSERBASE'`. If not provided, read from | ||
| the `BROWSERBASE_PROJECT_ID` environment variable.""" | ||
|
|
||
| model: str = 'openai/gpt-4.1-mini' |
There was a problem hiding this comment.
That's a fairly dated model, wouldn't 5.4-nano work better?
There was a problem hiding this comment.
I used the same model as JS. But if we're ready to upgrade the model, then yes, I think the 5.4-nano would be better.
There was a problem hiding this comment.
I guess we should do that on both sides, any thoughts @B4nan?
Description
Adds
StagehandCrawler- a new browser crawler powered by Stagehand that lets users interact with pages using natural language instead of CSS selectors or XPath. ExtendsPlaywrightCrawlerand inherits all of its features: routing, sessions, autoscaling, proxies, and navigation hooks.StagehandPageextends PlaywrightPagewith four AI methods:act(),extract(),observe(), andexecute().StagehandOptionsconfigures the AI model, execution environment (LOCAL/BROWSERBASE), and session parameters.StagehandBrowserPluginandStagehandBrowserControllerintegrate Stagehand into the browser pool, managing session lifecycle and fingerprint header injection.BrowserLaunchOptions.Issues
Testing
StagehandBrowserController,StagehandBrowserPlugin, andStagehandCrawlerwith Stagehand mocked out - no real LLM connection required to run the test suite.