Skip to content

[feat]: add ignoreSelectors to extract()#2084

Merged
seanmcguire12 merged 6 commits intomainfrom
feat/ignore-selectors-extract
May 6, 2026
Merged

[feat]: add ignoreSelectors to extract()#2084
seanmcguire12 merged 6 commits intomainfrom
feat/ignore-selectors-extract

Conversation

@seanmcguire12
Copy link
Copy Markdown
Member

@seanmcguire12 seanmcguire12 commented May 6, 2026

why

  • to allow users to specify selectors whose content should be omitted from the LLM request inside of extract()

what changed

  • added ignoreSelectors to extract() so callers can exclude parts of the page from the extraction snapshot
  • packages/core/lib/v3/understudy/a11y/snapshot/capture.ts:
    • added resolveIgnoreSelectorRoots() to resolve ignore selectors into frameId + backendNodeId
    • added buildFrameExclusionIntervals() to turn those roots into per-frame subtree ranges
    • updated collectPerFrameMaps() / tryScopedSnapshot() to skip ignored nodes when building snapshot data
  • packages/core/lib/v3/understudy/a11y/snapshot/domTree.ts:
    • updated buildSessionDomIndex() to record dfs entry/exit positions for each backend node so ignored subtrees can be filtered by interval membership instead of having to re-traverse the tree
  • packages/core/lib/v3/understudy/a11y/snapshot/a11yTree.ts:
    • updated a11yForFrame() to accept an ignore predicate and remove ignored nodes from the outline (the stringified tree) and urlMap
  • packages/core/lib/v3/handlers/extractHandler.ts and packages/core/lib/v3/v3.ts:
    • passed ignoreSelectors down through the extract path into captureHybridSnapshot()
  • kept xpathMap values based on the original dom structure and only skipped ignored entries, so surviving selectors do not get renumbered

test plan

  • updated packages/core/tests/unit/public-api/public-types.test.ts to cover the new ignoreSelectors option on extract()
  • updated packages/core/tests/unit/snapshot-a11y-resolvers.test.ts to cover ignored backend nodes being removed from the a11y outline and urlMap, and to cover scoped snapshots dropping ignored xpath entries
  • updated packages/core/tests/unit/snapshot-capture-orchestration.test.ts to cover the new capture flow, including filtered per-frame maps and merged snapshot output when ignoreSelectors is used

Summary by cubic

Adds ignoreSelectors to extract() so callers can exclude elements and their subtrees from snapshots and the LLM input. Supports CSS and XPath (with iframe hops) across shadow DOM and iframes, without renumbering existing XPaths; also adds a fast path for scoped snapshots and excludes child frame subtrees when an iframe host is ignored.

  • New Features
    • extract({ ignoreSelectors: string[] }): drops matching nodes and descendants from DOM maps, a11y outline, and urlMap; xpathMap keeps original values but omits ignored entries.
    • Applies to full-page and scoped snapshots; filters resolved per frame, excludes child frames when their iframe host is ignored, and enforces DFS enter/exit intervals for speed.
    • A11y tree accepts an ignore predicate and filters before building the outline and links.
    • Scoped snapshots return early when only selector is provided (no ignoreSelectors) for faster capture.
    • Plumbed through extractHandler, v3, and captureHybridSnapshot(); public types and OpenAPI updated in @browserbasehq/stagehand; unit tests added.

Written for commit 0ad2f67. Summary will update on new commits.

@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented May 6, 2026

🦋 Changeset detected

Latest commit: 0ad2f67

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 4 packages
Name Type
@browserbasehq/stagehand Minor
@browserbasehq/stagehand-evals Patch
@browserbasehq/stagehand-server-v3 Patch
@browserbasehq/stagehand-server-v4 Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 6, 2026

✱ Stainless preview builds

This PR will update the stagehand SDKs with the following commit message.

feat: [feat]: add `ignoreSelectors` to `extract()`
⚠️ stagehand-python studio · conflict

Your SDK build had at least one warning diagnostic.

⚠️ stagehand-typescript studio · code

Your SDK build had at least one "warning" diagnostic.
generate ⚠️build ✅lint ✅test ✅

npm install https://pkg.stainless.com/s/stagehand-typescript/8129516138e28fe6c07617a1d4de584b80633651/dist.tar.gz
stagehand-go studio · code

Your SDK build had at least one "note" diagnostic.
generate ✅build ⏭️lint ✅test ✅

go get github.com/stainless-sdks/stagehand-go@547119099d9201d53820f12345e44acd940cccc6
⚠️ stagehand-php studio · conflict

Your SDK build had at least one warning diagnostic.

stagehand-java studio · code

Your SDK build had at least one "note" diagnostic.
generate ✅build ✅lint ✅test ✅

Add the following URL as a Maven source: 'https://pkg.stainless.com/s/stagehand-java/ab25ab1c249cb689ea39cb2e3970927d616ddefd/mvn'
stagehand-openapi studio · code

Your SDK build had at least one "note" diagnostic.
generate ✅

stagehand-ruby studio · code

Your SDK build had at least one "note" diagnostic.
generate ✅build ⏭️lint ✅test ✅

⚠️ stagehand-csharp studio · code

Your SDK build had at least one "warning" diagnostic.
generate ⚠️build ✅lint ✅test ✅

stagehand-kotlin studio · code

Your SDK build had at least one "note" diagnostic.
generate ✅build ✅lint ✅test ✅


This comment is auto-generated by GitHub Actions and is automatically kept up to date as you push.
If you push custom code to the preview branch, re-run this workflow to update the comment.
Last updated: 2026-05-06 17:10:27 UTC

@seanmcguire12
Copy link
Copy Markdown
Member Author

@cubic-dev-ai

@cubic-dev-ai
Copy link
Copy Markdown
Contributor

cubic-dev-ai Bot commented May 6, 2026

@cubic-dev-ai

@seanmcguire12 I have started the AI code review. It will take a few minutes to complete.

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 14 files

Confidence score: 3/5

  • There is concrete regression risk in packages/core/lib/v3/understudy/a11y/snapshot/capture.ts: ignored iframes may still contribute child-frame content to merged snapshots, which can leak excluded regions into output.
  • The ignore-selector handling in packages/core/lib/v3/understudy/a11y/snapshot/capture.ts appears to process only the first match per selector, so additional matching nodes (for example broad selectors) may remain unexpectedly in snapshots.
  • Given the medium-high severity (6-7/10) and high confidence (8-9/10) on behavior that affects snapshot correctness, this is mergeable with caution but carries meaningful user-facing risk.
  • Pay close attention to packages/core/lib/v3/understudy/a11y/snapshot/capture.ts - iframe ignore propagation and multi-match selector exclusion need to be validated.
Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="packages/core/lib/v3/understudy/a11y/snapshot/capture.ts">

<violation number="1" location="packages/core/lib/v3/understudy/a11y/snapshot/capture.ts:418">
P2: This resolves only the first element matched by each ignore selector, so broad selectors like `.ad` leave later matching subtrees in the snapshot. Iterate all matches for each selector or otherwise collect every matching backendNodeId.</violation>

<violation number="2" location="packages/core/lib/v3/understudy/a11y/snapshot/capture.ts:550">
P1: Ignoring an iframe element does not propagate the exclusion to that child frame, so the iframe subtree can still appear in the merged snapshot. Add child-frame intervals for ignored iframe roots before collecting per-frame maps.</violation>
</file>
Architecture diagram
sequenceDiagram
    participant Client as Client (extract caller)
    participant Handler as ExtractHandler
    participant V3 as V3 Orchestrator
    participant Capture as captureHybridSnapshot()
    participant Selector as resolveIgnoreSelectorRoots()
    participant Frame as per-frame logic
    participant DomIndex as buildSessionDomIndex()
    participant A11y as a11yForFrame()
    participant Page as CDP Page / Frames

    Client->>Handler: extract({ instruction, schema, ignoreSelectors, ... })
    Handler->>V3: resolveLlmClient()
    V3-->>Handler: llmClient

    opt focusSelector present
        Handler->>Handler: compute focusSelector for LLM
    end

    Handler->>Capture: captureHybridSnapshot(page, { ignoreSelectors, focusSelector, ... })

    Capture->>Capture: buildFrameContext(page)

    alt ignoreSelectors provided
        Capture->>Capture: resolveIgnoreSelectorRoots(page, ignoreSelectors, context, sessionToIndex)
        Note over Capture,Selector: Resolves each ignore selector to frameId + backendNodeId
        Selector->>Page: resolveFocusFrameAndTail() / resolveCssFocusFrameAndTail()
        Page-->>Selector: { targetFrameId, tailXPath/tailSelector }
        Selector->>Page: resolveObjectIdForXPath() / resolveObjectIdForCss()
        Page-->>Selector: objectId
        Selector->>Page: DOM.describeNode({ objectId })
        Page-->>Selector: backendNodeId
        Selector-->>Capture: ignoreRootsByFrame (Map<frameId, Set<backendNodeId>>)

        Capture->>Capture: buildFrameExclusionIntervals(page, sessionToIndex, ignoreRootsByFrame)
        Note over Capture: Uses enterByBe / exitByBe from SessionDomIndex to compute [start, end) intervals per frame
        Capture-->>Capture: exclusionIntervalsByFrame

        Capture->>Capture: tryScopedSnapshot() with exclusionIntervalsByFrame
        Note over Capture: Filters xpathMap entries whose backendNodeId falls inside any exclusion interval
    else no ignoreSelectors
        Capture->>Capture: tryScopedSnapshot() without filter
    end

    alt scoped snapshot generated
        Capture->>Capture: buildSessionIndexes() / domMapsForSession()
        DomIndex->>DomIndex: buildSessionDomIndex() (records enterByBe, exitByBe)
        Note over DomIndex: DFS traversal now records enter/exit indices for interval-based filtering
        DomIndex-->>Capture: sessionToIndex

        Capture->>Capture: computeFramePrefixes()
        Capture->>Frame: a11yForFrame() with isIgnoredBackendNode predicate
        A11y->>A11y: filter accessibility nodes via isIgnoredBackendNode
        A11y-->>Capture: { outline, urlMap } (ignored nodes removed)
        Capture-->>Handler: HybridSnapshot (with xpathMap, urlMap, outline filtered)

        Handler->>Handler: call LLM with snapshot artifacts
        LLM-->>Handler: extracted data
        Handler-->>Client: result

    else full page capture required
        Capture->>Capture: collectPerFrameMaps() with exclusionIntervalsByFrame
        loop each frame in scope
            Frame->>Frame: resolveFrameDocRootBackendId()
            Frame->>Frame: iterate absByBe, skip if isIgnoredBackendNode(be) or not in frame's docRoot
            Frame->>A11y: a11yForFrame() with isIgnoredBackendNode predicate
            A11y->>A11y: filter accessibility nodes, outline, urlMap
            A11y-->>Frame: per-frame outline, urlMap, tagNameMap, xpathMap
        end
        Capture->>Capture: merge per-frame data into combinedTree, combinedXpathMap, combinedUrlMap
        Capture-->>Handler: HybridSnapshot (merged, ignoring selectors applied)

        Handler->>Handler: call LLM with merged snapshot
        LLM-->>Handler: extracted data
        Handler-->>Client: result
    end

    alt invalid/not-found ignore selector
        Selector->>Selector: catch error, continue to next selector
        Note over Selector: Silently skips unresolvable selectors
    end

    alt backendNodeId falls inside exclusion interval
        Frame->>Frame: skip node from xpathMap, tagNameMap, scrollableMap
        A11y->>A11y: skip node from outline and urlMap
    else backendNodeId outside exclusion intervals or not ignored
        Frame->>Frame: include node normally
    end
Loading

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.

Comment thread packages/core/lib/v3/understudy/a11y/snapshot/capture.ts
Comment thread packages/core/lib/v3/understudy/a11y/snapshot/capture.ts Outdated
@seanmcguire12
Copy link
Copy Markdown
Member Author

@cubic-dev-ai

@cubic-dev-ai
Copy link
Copy Markdown
Contributor

cubic-dev-ai Bot commented May 6, 2026

@cubic-dev-ai

@seanmcguire12 I have started the AI code review. It will take a few minutes to complete.

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 14 files

Confidence score: 5/5

  • Automated review surfaced no issues in the provided summaries.
  • No files require special attention.
Architecture diagram
sequenceDiagram
    participant Client as Caller (extract())
    participant EH as ExtractHandler
    participant V3 as V3
    participant CHS as captureHybridSnapshot()
    participant FC as FrameContext
    participant SR as SelectorResolver
    participant DI as buildSessionDomIndex()
    participant RI as resolveIgnoredNodes()
    participant FE as buildFrameExclusionIntervals()
    participant CFM as collectPerFrameMaps()
    participant TS as tryScopedSnapshot()
    participant AT as a11yForFrame()
    participant DT as domMapsForSession()
    participant CDP as CDP Session (Frame)

    Note over Client,CDP: NEW: ignoreSelectors Flow

    Client->>EH: extract({ ignoreSelectors, ... })
    EH->>V3: pass ignoreSelectors
    V3->>CHS: captureHybridSnapshot(page, { ignoreSelectors })

    CHS->>FC: buildFrameContext(page)
    CHS->>DI: buildSessionDomIndex() per frame
    
    Note over DI: NEW: records enterByBe/exitByBe<br/>for fast subtree interval checks

    CHS->>RI: resolveIgnoredNodes(page, ignoreSelectors, context, sessionToIndex)
    activate RI
    RI->>SR: resolveAll() for each selector
    
    alt Selector is XPath
        RI->>CDP: resolveFocusFrameAndTail(page, xpath, ...)
        CDP-->>RI: targetFrameId + tailXPath
        RI->>CDP: DOM.querySelector / evaluate
        CDP-->>RI: backendNodeId
    else Selector is CSS
        RI->>CDP: resolveCssFocusFrameAndTail(page, selector, ...)
        CDP-->>RI: targetFrameId + tailSelector
        RI->>CDP: FrameSelectorResolver.resolveAll()
        CDP-->>RI: resolved nodes
    end
    
    alt Node is iframe host
        RI->>RI: resolve child frame content document backendNodeId
        RI->>RI: add child frame root node to ignored set
    end
    
    RI-->>CHS: ignoredNodesByFrame (Map<frameId, Set<backendNodeId>>)
    deactivate RI

    CHS->>FE: buildFrameExclusionIntervals(page, context, sessionToIndex, ignoredNodesByFrame)
    activate FE
    FE->>FE: for each frame, compute DFS intervals from enterByBe/exitByBe
    FE-->>CHS: exclusionIntervalsByFrame (Map<frameId, Interval[]>)
    deactivate FE

    alt FocusSelector present AND ignoreSelectors present
        CHS->>TS: tryScopedSnapshot(..., sessionToIndex, exclusionIntervalsByFrame)
        activate TS
        TS->>DT: domMapsForSession()
        DT-->>TS: xpathMap, tagNameMap
        
        TS->>TS: filter xpathMap entries using isIgnoredBackendNode
        Note over TS: skip ignored nodes via interval membership
        
        TS->>AT: a11yForFrame(..., { isIgnoredBackendNode })
        activate AT
        AT->>AT: filter AX nodes using isIgnoredBackendNode
        AT-->>TS: filtered outline + urlMap
        deactivate AT
        
        TS-->>CHS: HybridSnapshot (scoped, with ignored nodes removed)
        deactivate TS
        CHS-->>V3: result
        V3-->>EH: result
        EH-->>Client: extracted data
    else No focusSelector OR no ignoreSelectors
        CHS->>CFM: collectPerFrameMaps(..., exclusionIntervalsByFrame)
        activate CFM
        Note over CFM: NEW: skip ignored nodes when building xpathMap/scrollableMap
        
        CFM->>CFM: for each frame, iterate absByBe entries
        alt isIgnoredBackendNode(be) is true
            CFM->>CFM: continue (skip entry)
        else
            CFM->>CFM: add to xpathMap, tagNameMap, scrollableMap
        end
        
        CFM->>AT: a11yForFrame(..., { isIgnoredBackendNode })
        activate AT
        AT->>AT: filter AX nodes using isIgnoredBackendNode
        AT-->>CFM: filtered outline + urlMap
        deactivate AT
        
        alt Child frame host is ignored
            CFM->>CFM: exclude child frame's entire subtree
            Note over CFM: contentDocRootByIframe + DFS intervals
        end
        
        CFM-->>CHS: perFrameMaps, perFrameOutlines
        deactivate CFM
        
        CHS->>CHS: merge filtered per-frame data
        CHS-->>V3: HybridSnapshot
        V3-->>EH: result
        EH-->>Client: extracted data
    end
Loading

@seanmcguire12 seanmcguire12 marked this pull request as ready for review May 6, 2026 02:09
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 14 files

Confidence score: 5/5

  • Automated review surfaced no issues in the provided summaries.
  • No files require special attention.
Architecture diagram
sequenceDiagram
    participant Client as Caller (extract())
    participant EH as ExtractHandler
    participant V3 as V3
    participant CHS as captureHybridSnapshot()
    participant FC as FrameContext
    participant SR as SelectorResolver
    participant DI as buildSessionDomIndex()
    participant RI as resolveIgnoredNodes()
    participant FE as buildFrameExclusionIntervals()
    participant CFM as collectPerFrameMaps()
    participant TS as tryScopedSnapshot()
    participant AT as a11yForFrame()
    participant DT as domMapsForSession()
    participant CDP as CDP Session (Frame)

    Note over Client,CDP: NEW: ignoreSelectors Flow

    Client->>EH: extract({ ignoreSelectors, ... })
    EH->>V3: pass ignoreSelectors
    V3->>CHS: captureHybridSnapshot(page, { ignoreSelectors })

    CHS->>FC: buildFrameContext(page)
    CHS->>DI: buildSessionDomIndex() per frame
    
    Note over DI: NEW: records enterByBe/exitByBe<br/>for fast subtree interval checks

    CHS->>RI: resolveIgnoredNodes(page, ignoreSelectors, context, sessionToIndex)
    activate RI
    RI->>SR: resolveAll() for each selector
    
    alt Selector is XPath
        RI->>CDP: resolveFocusFrameAndTail(page, xpath, ...)
        CDP-->>RI: targetFrameId + tailXPath
        RI->>CDP: DOM.querySelector / evaluate
        CDP-->>RI: backendNodeId
    else Selector is CSS
        RI->>CDP: resolveCssFocusFrameAndTail(page, selector, ...)
        CDP-->>RI: targetFrameId + tailSelector
        RI->>CDP: FrameSelectorResolver.resolveAll()
        CDP-->>RI: resolved nodes
    end
    
    alt Node is iframe host
        RI->>RI: resolve child frame content document backendNodeId
        RI->>RI: add child frame root node to ignored set
    end
    
    RI-->>CHS: ignoredNodesByFrame (Map<frameId, Set<backendNodeId>>)
    deactivate RI

    CHS->>FE: buildFrameExclusionIntervals(page, context, sessionToIndex, ignoredNodesByFrame)
    activate FE
    FE->>FE: for each frame, compute DFS intervals from enterByBe/exitByBe
    FE-->>CHS: exclusionIntervalsByFrame (Map<frameId, Interval[]>)
    deactivate FE

    alt FocusSelector present AND ignoreSelectors present
        CHS->>TS: tryScopedSnapshot(..., sessionToIndex, exclusionIntervalsByFrame)
        activate TS
        TS->>DT: domMapsForSession()
        DT-->>TS: xpathMap, tagNameMap
        
        TS->>TS: filter xpathMap entries using isIgnoredBackendNode
        Note over TS: skip ignored nodes via interval membership
        
        TS->>AT: a11yForFrame(..., { isIgnoredBackendNode })
        activate AT
        AT->>AT: filter AX nodes using isIgnoredBackendNode
        AT-->>TS: filtered outline + urlMap
        deactivate AT
        
        TS-->>CHS: HybridSnapshot (scoped, with ignored nodes removed)
        deactivate TS
        CHS-->>V3: result
        V3-->>EH: result
        EH-->>Client: extracted data
    else No focusSelector OR no ignoreSelectors
        CHS->>CFM: collectPerFrameMaps(..., exclusionIntervalsByFrame)
        activate CFM
        Note over CFM: NEW: skip ignored nodes when building xpathMap/scrollableMap
        
        CFM->>CFM: for each frame, iterate absByBe entries
        alt isIgnoredBackendNode(be) is true
            CFM->>CFM: continue (skip entry)
        else
            CFM->>CFM: add to xpathMap, tagNameMap, scrollableMap
        end
        
        CFM->>AT: a11yForFrame(..., { isIgnoredBackendNode })
        activate AT
        AT->>AT: filter AX nodes using isIgnoredBackendNode
        AT-->>CFM: filtered outline + urlMap
        deactivate AT
        
        alt Child frame host is ignored
            CFM->>CFM: exclude child frame's entire subtree
            Note over CFM: contentDocRootByIframe + DFS intervals
        end
        
        CFM-->>CHS: perFrameMaps, perFrameOutlines
        deactivate CFM
        
        CHS->>CHS: merge filtered per-frame data
        CHS-->>V3: HybridSnapshot
        V3-->>EH: result
        EH-->>Client: extracted data
    end
Loading

@seanmcguire12 seanmcguire12 merged commit 0641d44 into main May 6, 2026
207 checks passed
@github-actions github-actions Bot mentioned this pull request May 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants