Skip to content

Expose crawler_config on all MCP scrape tools#1965

Open
SohamKukreti wants to merge 1 commit into
developfrom
fix/mcp-crawler-config-passthrough
Open

Expose crawler_config on all MCP scrape tools#1965
SohamKukreti wants to merge 1 commit into
developfrom
fix/mcp-crawler-config-passthrough

Conversation

@SohamKukreti
Copy link
Copy Markdown
Collaborator

Summary

MCP tools (md, html, screenshot, pdf, execute_js) hardcoded CrawlerRunConfig() with no user input, so wait_until, delay_before_return_html, cache_mode, and all other CrawlerRunConfig fields were silently ignored. /crawl already had full passthrough; this brings the remaining tools to parity.

Addresses #1963

List of files changed and why

  • schemas.py: add crawler_config: Optional[Dict] to all five request schemas so mcp_bridge.py exposes the field in MCP tool inputSchemas automatically
  • server.py: handlers now load via CrawlerRunConfig.load() then stamp endpoint-required fields on top (screenshot, pdf, js_code); fix screenshot_wait_for/wait_for_images defaults from 2/False to None so they only override crawler_config when explicitly passed
  • api.py: handle_markdown_request accepts crawler_config kwarg; cache_mode precedence uses key-presence check instead of falsy check so crawler_config.cache_mode correctly wins over legacy c
  • tests/mcp/test_mcp_crawler_config.py — 7 MCP SSE tests proving delay_before_return_html is honoured server-side on all tools

How Has This Been Tested?

Ran tests in tests/mcp/ folder and docker tests to verify none of the existing functionality broke.
Created new tests to verify crawler config is respected when passed via MCP

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added/updated unit tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

  MCP tools (md, html, screenshot, pdf, execute_js) hardcoded
  CrawlerRunConfig() with no user input, so wait_until,
  delay_before_return_html, cache_mode, and all other
  CrawlerRunConfig fields were silently ignored. /crawl already
  had full passthrough; this brings the remaining tools to parity.

  - schemas.py: add crawler_config: Optional[Dict] to all five
    request schemas so mcp_bridge.py exposes the field in MCP
    tool inputSchemas automatically
  - server.py: handlers now load via CrawlerRunConfig.load() then
    stamp endpoint-required fields on top (screenshot, pdf, js_code);
    fix screenshot_wait_for/wait_for_images defaults from 2/False to
    None so they only override crawler_config when explicitly passed
  - api.py: handle_markdown_request accepts crawler_config kwarg;
    cache_mode precedence uses key-presence check instead of falsy
    check so crawler_config.cache_mode correctly wins over legacy c

  Tests: tests/mcp/test_mcp_crawler_config.py — 7 MCP SSE tests
  proving delay_before_return_html is honoured server-side on all tools
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant