[feat]: add ignoreSelectors to extract()#2084
Conversation
🦋 Changeset detectedLatest commit: 0ad2f67 The changes in this PR will be included in the next version bump. This PR includes changesets to release 4 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
✱ Stainless preview buildsThis PR will update the
|
|
@seanmcguire12 I have started the AI code review. It will take a few minutes to complete. |
There was a problem hiding this comment.
2 issues found across 14 files
Confidence score: 3/5
- There is concrete regression risk in
packages/core/lib/v3/understudy/a11y/snapshot/capture.ts: ignored iframes may still contribute child-frame content to merged snapshots, which can leak excluded regions into output. - The ignore-selector handling in
packages/core/lib/v3/understudy/a11y/snapshot/capture.tsappears to process only the first match per selector, so additional matching nodes (for example broad selectors) may remain unexpectedly in snapshots. - Given the medium-high severity (6-7/10) and high confidence (8-9/10) on behavior that affects snapshot correctness, this is mergeable with caution but carries meaningful user-facing risk.
- Pay close attention to
packages/core/lib/v3/understudy/a11y/snapshot/capture.ts- iframe ignore propagation and multi-match selector exclusion need to be validated.
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="packages/core/lib/v3/understudy/a11y/snapshot/capture.ts">
<violation number="1" location="packages/core/lib/v3/understudy/a11y/snapshot/capture.ts:418">
P2: This resolves only the first element matched by each ignore selector, so broad selectors like `.ad` leave later matching subtrees in the snapshot. Iterate all matches for each selector or otherwise collect every matching backendNodeId.</violation>
<violation number="2" location="packages/core/lib/v3/understudy/a11y/snapshot/capture.ts:550">
P1: Ignoring an iframe element does not propagate the exclusion to that child frame, so the iframe subtree can still appear in the merged snapshot. Add child-frame intervals for ignored iframe roots before collecting per-frame maps.</violation>
</file>
Architecture diagram
sequenceDiagram
participant Client as Client (extract caller)
participant Handler as ExtractHandler
participant V3 as V3 Orchestrator
participant Capture as captureHybridSnapshot()
participant Selector as resolveIgnoreSelectorRoots()
participant Frame as per-frame logic
participant DomIndex as buildSessionDomIndex()
participant A11y as a11yForFrame()
participant Page as CDP Page / Frames
Client->>Handler: extract({ instruction, schema, ignoreSelectors, ... })
Handler->>V3: resolveLlmClient()
V3-->>Handler: llmClient
opt focusSelector present
Handler->>Handler: compute focusSelector for LLM
end
Handler->>Capture: captureHybridSnapshot(page, { ignoreSelectors, focusSelector, ... })
Capture->>Capture: buildFrameContext(page)
alt ignoreSelectors provided
Capture->>Capture: resolveIgnoreSelectorRoots(page, ignoreSelectors, context, sessionToIndex)
Note over Capture,Selector: Resolves each ignore selector to frameId + backendNodeId
Selector->>Page: resolveFocusFrameAndTail() / resolveCssFocusFrameAndTail()
Page-->>Selector: { targetFrameId, tailXPath/tailSelector }
Selector->>Page: resolveObjectIdForXPath() / resolveObjectIdForCss()
Page-->>Selector: objectId
Selector->>Page: DOM.describeNode({ objectId })
Page-->>Selector: backendNodeId
Selector-->>Capture: ignoreRootsByFrame (Map<frameId, Set<backendNodeId>>)
Capture->>Capture: buildFrameExclusionIntervals(page, sessionToIndex, ignoreRootsByFrame)
Note over Capture: Uses enterByBe / exitByBe from SessionDomIndex to compute [start, end) intervals per frame
Capture-->>Capture: exclusionIntervalsByFrame
Capture->>Capture: tryScopedSnapshot() with exclusionIntervalsByFrame
Note over Capture: Filters xpathMap entries whose backendNodeId falls inside any exclusion interval
else no ignoreSelectors
Capture->>Capture: tryScopedSnapshot() without filter
end
alt scoped snapshot generated
Capture->>Capture: buildSessionIndexes() / domMapsForSession()
DomIndex->>DomIndex: buildSessionDomIndex() (records enterByBe, exitByBe)
Note over DomIndex: DFS traversal now records enter/exit indices for interval-based filtering
DomIndex-->>Capture: sessionToIndex
Capture->>Capture: computeFramePrefixes()
Capture->>Frame: a11yForFrame() with isIgnoredBackendNode predicate
A11y->>A11y: filter accessibility nodes via isIgnoredBackendNode
A11y-->>Capture: { outline, urlMap } (ignored nodes removed)
Capture-->>Handler: HybridSnapshot (with xpathMap, urlMap, outline filtered)
Handler->>Handler: call LLM with snapshot artifacts
LLM-->>Handler: extracted data
Handler-->>Client: result
else full page capture required
Capture->>Capture: collectPerFrameMaps() with exclusionIntervalsByFrame
loop each frame in scope
Frame->>Frame: resolveFrameDocRootBackendId()
Frame->>Frame: iterate absByBe, skip if isIgnoredBackendNode(be) or not in frame's docRoot
Frame->>A11y: a11yForFrame() with isIgnoredBackendNode predicate
A11y->>A11y: filter accessibility nodes, outline, urlMap
A11y-->>Frame: per-frame outline, urlMap, tagNameMap, xpathMap
end
Capture->>Capture: merge per-frame data into combinedTree, combinedXpathMap, combinedUrlMap
Capture-->>Handler: HybridSnapshot (merged, ignoring selectors applied)
Handler->>Handler: call LLM with merged snapshot
LLM-->>Handler: extracted data
Handler-->>Client: result
end
alt invalid/not-found ignore selector
Selector->>Selector: catch error, continue to next selector
Note over Selector: Silently skips unresolvable selectors
end
alt backendNodeId falls inside exclusion interval
Frame->>Frame: skip node from xpathMap, tagNameMap, scrollableMap
A11y->>A11y: skip node from outline and urlMap
else backendNodeId outside exclusion intervals or not ignored
Frame->>Frame: include node normally
end
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.
|
@seanmcguire12 I have started the AI code review. It will take a few minutes to complete. |
There was a problem hiding this comment.
No issues found across 14 files
Confidence score: 5/5
- Automated review surfaced no issues in the provided summaries.
- No files require special attention.
Architecture diagram
sequenceDiagram
participant Client as Caller (extract())
participant EH as ExtractHandler
participant V3 as V3
participant CHS as captureHybridSnapshot()
participant FC as FrameContext
participant SR as SelectorResolver
participant DI as buildSessionDomIndex()
participant RI as resolveIgnoredNodes()
participant FE as buildFrameExclusionIntervals()
participant CFM as collectPerFrameMaps()
participant TS as tryScopedSnapshot()
participant AT as a11yForFrame()
participant DT as domMapsForSession()
participant CDP as CDP Session (Frame)
Note over Client,CDP: NEW: ignoreSelectors Flow
Client->>EH: extract({ ignoreSelectors, ... })
EH->>V3: pass ignoreSelectors
V3->>CHS: captureHybridSnapshot(page, { ignoreSelectors })
CHS->>FC: buildFrameContext(page)
CHS->>DI: buildSessionDomIndex() per frame
Note over DI: NEW: records enterByBe/exitByBe<br/>for fast subtree interval checks
CHS->>RI: resolveIgnoredNodes(page, ignoreSelectors, context, sessionToIndex)
activate RI
RI->>SR: resolveAll() for each selector
alt Selector is XPath
RI->>CDP: resolveFocusFrameAndTail(page, xpath, ...)
CDP-->>RI: targetFrameId + tailXPath
RI->>CDP: DOM.querySelector / evaluate
CDP-->>RI: backendNodeId
else Selector is CSS
RI->>CDP: resolveCssFocusFrameAndTail(page, selector, ...)
CDP-->>RI: targetFrameId + tailSelector
RI->>CDP: FrameSelectorResolver.resolveAll()
CDP-->>RI: resolved nodes
end
alt Node is iframe host
RI->>RI: resolve child frame content document backendNodeId
RI->>RI: add child frame root node to ignored set
end
RI-->>CHS: ignoredNodesByFrame (Map<frameId, Set<backendNodeId>>)
deactivate RI
CHS->>FE: buildFrameExclusionIntervals(page, context, sessionToIndex, ignoredNodesByFrame)
activate FE
FE->>FE: for each frame, compute DFS intervals from enterByBe/exitByBe
FE-->>CHS: exclusionIntervalsByFrame (Map<frameId, Interval[]>)
deactivate FE
alt FocusSelector present AND ignoreSelectors present
CHS->>TS: tryScopedSnapshot(..., sessionToIndex, exclusionIntervalsByFrame)
activate TS
TS->>DT: domMapsForSession()
DT-->>TS: xpathMap, tagNameMap
TS->>TS: filter xpathMap entries using isIgnoredBackendNode
Note over TS: skip ignored nodes via interval membership
TS->>AT: a11yForFrame(..., { isIgnoredBackendNode })
activate AT
AT->>AT: filter AX nodes using isIgnoredBackendNode
AT-->>TS: filtered outline + urlMap
deactivate AT
TS-->>CHS: HybridSnapshot (scoped, with ignored nodes removed)
deactivate TS
CHS-->>V3: result
V3-->>EH: result
EH-->>Client: extracted data
else No focusSelector OR no ignoreSelectors
CHS->>CFM: collectPerFrameMaps(..., exclusionIntervalsByFrame)
activate CFM
Note over CFM: NEW: skip ignored nodes when building xpathMap/scrollableMap
CFM->>CFM: for each frame, iterate absByBe entries
alt isIgnoredBackendNode(be) is true
CFM->>CFM: continue (skip entry)
else
CFM->>CFM: add to xpathMap, tagNameMap, scrollableMap
end
CFM->>AT: a11yForFrame(..., { isIgnoredBackendNode })
activate AT
AT->>AT: filter AX nodes using isIgnoredBackendNode
AT-->>CFM: filtered outline + urlMap
deactivate AT
alt Child frame host is ignored
CFM->>CFM: exclude child frame's entire subtree
Note over CFM: contentDocRootByIframe + DFS intervals
end
CFM-->>CHS: perFrameMaps, perFrameOutlines
deactivate CFM
CHS->>CHS: merge filtered per-frame data
CHS-->>V3: HybridSnapshot
V3-->>EH: result
EH-->>Client: extracted data
end
There was a problem hiding this comment.
No issues found across 14 files
Confidence score: 5/5
- Automated review surfaced no issues in the provided summaries.
- No files require special attention.
Architecture diagram
sequenceDiagram
participant Client as Caller (extract())
participant EH as ExtractHandler
participant V3 as V3
participant CHS as captureHybridSnapshot()
participant FC as FrameContext
participant SR as SelectorResolver
participant DI as buildSessionDomIndex()
participant RI as resolveIgnoredNodes()
participant FE as buildFrameExclusionIntervals()
participant CFM as collectPerFrameMaps()
participant TS as tryScopedSnapshot()
participant AT as a11yForFrame()
participant DT as domMapsForSession()
participant CDP as CDP Session (Frame)
Note over Client,CDP: NEW: ignoreSelectors Flow
Client->>EH: extract({ ignoreSelectors, ... })
EH->>V3: pass ignoreSelectors
V3->>CHS: captureHybridSnapshot(page, { ignoreSelectors })
CHS->>FC: buildFrameContext(page)
CHS->>DI: buildSessionDomIndex() per frame
Note over DI: NEW: records enterByBe/exitByBe<br/>for fast subtree interval checks
CHS->>RI: resolveIgnoredNodes(page, ignoreSelectors, context, sessionToIndex)
activate RI
RI->>SR: resolveAll() for each selector
alt Selector is XPath
RI->>CDP: resolveFocusFrameAndTail(page, xpath, ...)
CDP-->>RI: targetFrameId + tailXPath
RI->>CDP: DOM.querySelector / evaluate
CDP-->>RI: backendNodeId
else Selector is CSS
RI->>CDP: resolveCssFocusFrameAndTail(page, selector, ...)
CDP-->>RI: targetFrameId + tailSelector
RI->>CDP: FrameSelectorResolver.resolveAll()
CDP-->>RI: resolved nodes
end
alt Node is iframe host
RI->>RI: resolve child frame content document backendNodeId
RI->>RI: add child frame root node to ignored set
end
RI-->>CHS: ignoredNodesByFrame (Map<frameId, Set<backendNodeId>>)
deactivate RI
CHS->>FE: buildFrameExclusionIntervals(page, context, sessionToIndex, ignoredNodesByFrame)
activate FE
FE->>FE: for each frame, compute DFS intervals from enterByBe/exitByBe
FE-->>CHS: exclusionIntervalsByFrame (Map<frameId, Interval[]>)
deactivate FE
alt FocusSelector present AND ignoreSelectors present
CHS->>TS: tryScopedSnapshot(..., sessionToIndex, exclusionIntervalsByFrame)
activate TS
TS->>DT: domMapsForSession()
DT-->>TS: xpathMap, tagNameMap
TS->>TS: filter xpathMap entries using isIgnoredBackendNode
Note over TS: skip ignored nodes via interval membership
TS->>AT: a11yForFrame(..., { isIgnoredBackendNode })
activate AT
AT->>AT: filter AX nodes using isIgnoredBackendNode
AT-->>TS: filtered outline + urlMap
deactivate AT
TS-->>CHS: HybridSnapshot (scoped, with ignored nodes removed)
deactivate TS
CHS-->>V3: result
V3-->>EH: result
EH-->>Client: extracted data
else No focusSelector OR no ignoreSelectors
CHS->>CFM: collectPerFrameMaps(..., exclusionIntervalsByFrame)
activate CFM
Note over CFM: NEW: skip ignored nodes when building xpathMap/scrollableMap
CFM->>CFM: for each frame, iterate absByBe entries
alt isIgnoredBackendNode(be) is true
CFM->>CFM: continue (skip entry)
else
CFM->>CFM: add to xpathMap, tagNameMap, scrollableMap
end
CFM->>AT: a11yForFrame(..., { isIgnoredBackendNode })
activate AT
AT->>AT: filter AX nodes using isIgnoredBackendNode
AT-->>CFM: filtered outline + urlMap
deactivate AT
alt Child frame host is ignored
CFM->>CFM: exclude child frame's entire subtree
Note over CFM: contentDocRootByIframe + DFS intervals
end
CFM-->>CHS: perFrameMaps, perFrameOutlines
deactivate CFM
CHS->>CHS: merge filtered per-frame data
CHS-->>V3: HybridSnapshot
V3-->>EH: result
EH-->>Client: extracted data
end
why
extract()what changed
ignoreSelectorstoextract()so callers can exclude parts of the page from the extraction snapshotpackages/core/lib/v3/understudy/a11y/snapshot/capture.ts:resolveIgnoreSelectorRoots()to resolve ignore selectors intoframeId + backendNodeIdbuildFrameExclusionIntervals()to turn those roots into per-frame subtree rangescollectPerFrameMaps()/tryScopedSnapshot()to skip ignored nodes when building snapshot datapackages/core/lib/v3/understudy/a11y/snapshot/domTree.ts:buildSessionDomIndex()to record dfs entry/exit positions for each backend node so ignored subtrees can be filtered by interval membership instead of having to re-traverse the treepackages/core/lib/v3/understudy/a11y/snapshot/a11yTree.ts:a11yForFrame()to accept an ignore predicate and remove ignored nodes from the outline (the stringified tree) andurlMappackages/core/lib/v3/handlers/extractHandler.tsandpackages/core/lib/v3/v3.ts:ignoreSelectorsdown through the extract path intocaptureHybridSnapshot()xpathMapvalues based on the original dom structure and only skipped ignored entries, so surviving selectors do not get renumberedtest plan
packages/core/tests/unit/public-api/public-types.test.tsto cover the newignoreSelectorsoption onextract()packages/core/tests/unit/snapshot-a11y-resolvers.test.tsto cover ignored backend nodes being removed from the a11y outline andurlMap, and to cover scoped snapshots dropping ignored xpath entriespackages/core/tests/unit/snapshot-capture-orchestration.test.tsto cover the new capture flow, including filtered per-frame maps and merged snapshot output whenignoreSelectorsis usedSummary by cubic
Adds
ignoreSelectorstoextract()so callers can exclude elements and their subtrees from snapshots and the LLM input. Supports CSS and XPath (with iframe hops) across shadow DOM and iframes, without renumbering existing XPaths; also adds a fast path for scoped snapshots and excludes child frame subtrees when an iframe host is ignored.extract({ ignoreSelectors: string[] }): drops matching nodes and descendants from DOM maps, a11y outline, andurlMap;xpathMapkeeps original values but omits ignored entries.selectoris provided (noignoreSelectors) for faster capture.extractHandler,v3, andcaptureHybridSnapshot(); public types and OpenAPI updated in@browserbasehq/stagehand; unit tests added.Written for commit 0ad2f67. Summary will update on new commits.