Skip to content

feat: add query owner context metadata#560

Merged
buger merged 6 commits into
mainfrom
feat/semantic-owner-context-557-558
May 6, 2026
Merged

feat: add query owner context metadata#560
buger merged 6 commits into
mainfrom
feat/semantic-owner-context-557-558

Conversation

@buger
Copy link
Copy Markdown
Collaborator

@buger buger commented May 6, 2026

Summary

  • add probe query --with-context / --owner-context for opt-in JSON owner-block metadata
  • introduce a shared language-agnostic source context layer for AST match, owner, comments, enclosing symbols, and enclosing calls
  • reuse parsed source context per file while formatting query/search JSON, avoiding parse-per-match behavior
  • improve search JSON metadata with language, owner symbols, qualified owners, enclosing symbols/calls, leading comments, and classified text matches
  • improve JS/TS symbols output by unwrapping exported declarations to their semantic inner names
  • handle multiline block comments, HTML comments, and multiline strings when classifying search match metadata
  • document the implemented query/search JSON context surfaces
  • add mixed TS/Go req-id search JSON regressions for the --no-merge evidence path, including negative string/comment hit classification

Implemented: probe query --with-context

probe query now has a new opt-in flag:

probe query 'fetch($$$ARGS)' ./src -l typescript --format json --with-context

The default query JSON remains backward-compatible. With --with-context, each result keeps the existing fields and additionally includes:

  • schema_version: "probe.query.context.v1" at the top level
  • language: inferred source language
  • pattern: source pattern and nullable pattern ID
  • match: exact AST match node type, content, line range, and column range
  • owner: smallest useful enclosing source block where available
  • owner.symbol and owner.qualified_symbol
  • owner.node_type, owner.scope, owner.lines, and owner.columns
  • owner.comments: attached leading/source comments as raw source facts
  • owner.enclosing_symbols: containing class/module/impl-style symbols where knowable
  • owner.enclosing_call / owner.enclosing_calls: generic call context for callback/call nesting where knowable
  • owner.content: owning block source text when available

This is intentionally generic. Probe reports source facts only; it does not parse requirement IDs, policy annotations, checklist semantics, test frameworks, or security meanings.

Implemented: search JSON source metadata

Search JSON keeps its existing shape but now includes additional source facts useful for text-first evidence discovery:

  • language
  • owner_symbol for common owner blocks, including TS/JS methods and exported const arrows
  • owner_qualified_symbol, e.g. PolicyService.evaluatePolicy
  • enclosing_symbols, e.g. containing class/module/impl symbols
  • enclosing_call / enclosing_calls, e.g. generic describe(...) / it(...) callback call chains without framework interpretation
  • leading_comments with line ranges and raw text
  • matches with text, line/column range, kind (comment, string, code), and comment_role when applicable

The added regressions cover the intended search path:

probe search --allow-tests --strict-elastic-syntax --max-results 20 --no-merge \
  --format json '"SYS-REQ-424" OR "SYS-REQ-425"' ./fixture

They verify TS method ownership, exported const arrow ownership, qualified class-method identity, nested callback call chains, test callback scope, leading comments, classified comment matches, string-literal hits, and loose comment hits. Probe still does not interpret which comments are valid evidence; downstream tools decide that from the raw source facts.

Performance / maintainability notes

  • Query/search context generation now uses a reusable ParsedSourceContext abstraction.
  • JSON formatting keeps a per-command HashMap<PathBuf, Option<ParsedSourceContext>>, so each source file is parsed at most once per formatting pass.
  • Unsupported or unparseable files are cached as None, so repeated results from the same unsupported file do not retry parsing.
  • This keeps the PR scoped: no new product mode or global parser cache, just efficient reuse inside the existing output path.

JS/TS fixes

  • probe search -o json can now report owner symbols for TS/JS method_definition blocks.
  • probe search -o json can now report owner symbols for exported const arrow functions even when the returned block is an export_statement.
  • probe symbols now unwraps common TS/JS export_statement / declare_statement wrappers and returns semantic names like PolicyService and normalizeDecision instead of generic export_statement names.

Architecture notes

  • Keeps existing default output compatible; richer query metadata is opt-in via --with-context.
  • Uses existing parser/language abstractions rather than adopting the issue contract verbatim.
  • Shares source-context helpers between query and search while keeping their JSON contracts compatible with existing command behavior.
  • Treats this as the strong Probe-native slice: structural query context and neutral search/source metadata.

Out of scope for this PR

  • batch pattern execution / --patterns-file
  • new search --semantic-blocks behavior
  • new extract --semantic-block behavior
  • wildcard query syntax changes beyond existing ast-grep support
  • policy interpretation of comments, requirement IDs, frameworks, or checklist semantics

Tests

  • cargo fmt --all -- --check
  • cargo clippy --all-targets --all-features -- -D warnings
  • cargo test --lib
  • cargo test --test integration_tests
  • cargo test --test query_command_tests
  • cargo test --test query_command_json_tests test_query_json_with_context_reports_owner_block
  • cargo test --test json_format_tests test_search_json_no_merge_reports_req_id_source_metadata
  • cargo test --test json_format_tests test_search_json_classifies_req_id_noise_without_policy_interpretation
  • npm test -- --runInBand npm/tests/unit/search-delegate.test.js
  • git diff --check

Refs #557
Refs #558

buger added 2 commits May 6, 2026 06:05
Add --with-context for query JSON output, backed by a shared language-agnostic source context helper.

Also expose compatible search JSON metadata for language, leading comments, and classified text matches, and unwrap JS/TS export statements in symbols output.
Track block comments, HTML comments, and multiline quoted strings when adding search match metadata so the new JSON fields do not regress existing search comment handling.
@probelabs
Copy link
Copy Markdown
Contributor

probelabs Bot commented May 6, 2026

PR Overview: Add Query Owner Context Metadata

Summary

This PR introduces a language-agnostic source context layer for Probe's query and search commands, enabling rich owner-block metadata extraction without interpreting domain semantics. The implementation adds opt-in JSON context for structural queries and enhances search JSON output with source facts.

Key Changes

1. New --with-context Flag for Query Command

Usage:

probe query 'fetch($$$ARGS)' ./src -l typescript --format json --with-context

Adds to each result:

  • schema_version: "probe.query.context.v1"
  • language: Inferred source language
  • pattern: Source pattern and nullable pattern ID
  • match: Exact AST match node type, content, line/column ranges
  • owner: Smallest enclosing source block with:
    • symbol and qualified_symbol (e.g., PolicyService.evaluatePolicy)
    • node_type, scope, lines, columns
    • comments: Leading/source comments as raw facts
    • enclosing_symbols: Containing class/module/impl symbols
    • enclosing_call / enclosing_calls: Generic call context for callbacks
    • content: Owning block source text

2. Enhanced Search JSON Output

New fields added to existing search JSON:

  • language: Inferred language when available
  • owner_symbol: Owning symbol (methods, exported const arrows)
  • owner_qualified_symbol: Qualified owner (e.g., PolicyService.evaluatePolicy)
  • enclosing_symbols: Containing class/module/impl symbols
  • enclosing_call / enclosing_calls: Generic call chains (e.g., describe(...) / it(...))
  • leading_comments: Raw leading comments with line ranges
  • matches: Classified text matches with:
    • kind: "comment", "string", or "code"
    • comment_role: "leading" when applicable

3. New semantic_context.rs Module (982 lines)

Core responsibilities:

  • AST traversal for owner node detection
  • Symbol extraction from tree-sitter nodes
  • Comment parsing (block, line, HTML)
  • String/comment classification for matches
  • Scope classification (function, module, declaration, test)
  • Enclosing symbol and call chain collection

Key types:

  • ParsedSourceContext: Caches parsed tree for a file
  • QuerySourceContext: Context for query matches
  • OwnerContext: Rich owner block metadata
  • SourceComment, SourceMatch, EnclosingSymbol, EnclosingCall

4. JS/TS Symbol Extraction Fixes

Changes in src/extract/symbols.rs:

  • Added semantic_symbol_node() to unwrap export_statement / declare_statement wrappers
  • Now returns semantic names like PolicyService instead of generic export_statement
  • Handles exported const arrow functions correctly

Test coverage:

test_exported_typescript_symbols_use_inner_declaration_names()

5. CLI and NPM Integration

CLI changes (src/cli.rs):

  • Added --with-context / --owner-context flag to query command

NPM changes:

  • npm/src/query.js: Added withContext flag mapping
  • npm/src/tools/common.js: Updated querySchema with with_context field
  • npm/src/tools/vercel.js: Pass withContext through to query execution

6. Comprehensive Test Coverage

New tests in tests/json_format_tests.rs:

  • test_search_json_no_merge_reports_req_id_source_metadata: Validates TS method ownership, exported const arrows, qualified symbols, nested callback chains, leading comments, and classified matches
  • test_search_json_classifies_req_id_noise_without_policy_interpretation: Ensures string literals, loose comments, and non-annotation comments are classified correctly without domain interpretation

New test in tests/query_command_json_tests.rs:

  • test_query_json_with_context_reports_owner_block: Validates query context output with owner block, comments, and match metadata

Architecture & Impact

Component Relationships

graph TD
    A[query --with-context] --> B[format_and_print_query_results]
    B --> C[ParsedSourceContext::parse]
    C --> D[query_source_context]
    D --> E[find_owner_node]
    D --> F[build_owner_context]
    F --> G[extract_symbol_name_from_node]
    F --> H[collect_enclosing_symbols]
    F --> I[collect_enclosing_calls]
    F --> J[leading_comments_before_node]
    
    K[search -o json] --> L[format_and_print_json_results]
    L --> C
    L --> M[search_owner_context]
    M --> F
    L --> N[leading_comments_from_block]
    L --> O[classify_text_matches_in_block]
    
    P[semantic_context.rs] -.-> C
    P -.-> D
    P -.-> M
    P -.-> F
Loading

Data Flow for Query Context

sequenceDiagram
    participant User
    participant CLI
    participant Query
    participant SemanticContext
    participant TreeSitter
    
    User->>CLI: probe query 'pattern' --with-context
    CLI->>Query: handle_query(with_context=true)
    Query->>Query: query_file() - find matches
    Query->>SemanticContext: ParsedSourceContext::parse(file)
    SemanticContext->>TreeSitter: parse source code
    TreeSitter-->>SemanticContext: AST tree
    Query->>SemanticContext: query_source_context(byte_start, byte_end)
    SemanticContext->>SemanticContext: find_smallest_covering_node()
    SemanticContext->>SemanticContext: find_owner_node()
    SemanticContext->>SemanticContext: build_owner_context()
    SemanticContext-->>Query: QuerySourceContext
    Query->>Query: Build JSON with owner fields
    Query-->>User: JSON with schema_version, match, owner
Loading

Affected System Components

  1. Query Command (src/query.rs)

    • Added with_context to QueryOptions
    • Modified format_and_print_query_results() to include context when enabled
    • Caches parsed files per query run
  2. Search Output (src/search/search_output.rs)

    • Enhanced JsonResult struct with new fields
    • Integrated semantic_context functions for owner extraction
    • Added match classification and comment extraction
  3. Symbol Extraction (src/extract/symbols.rs)

    • Added semantic_symbol_node() for wrapper unwrapping
    • Fixed TS/JS exported declaration handling
  4. NPM Package

    • Updated query tool schema and execution
    • Added withContext parameter support
  5. Documentation

    • Updated CLI reference (docs/probe-cli/query.md)
    • Updated output formats reference (docs/reference/output-formats.md)
    • Updated SDK tools reference (docs/probe-agent/sdk/tools-reference.md)

Scope Discovery & Context Expansion

Immediate Impact

Directly affected modules:

  • src/query.rs - Query formatting and options
  • src/search/search_output.rs - JSON output formatting
  • src/extract/symbols.rs - Symbol extraction logic
  • src/semantic_context.rs - NEW - Core context extraction
  • src/cli.rs - CLI argument definitions
  • src/main.rs - Main entry point wiring
  • src/lib.rs - Module exports

Test files:

  • tests/json_format_tests.rs - Search JSON regressions
  • tests/query_command_json_tests.rs - Query JSON regressions
  • tests/query_command_tests.rs - Updated for new option
  • tests/lib_usage.rs - Updated for new option

Related Files to Review

Language implementations (for owner node detection):

  • src/language/rust.rs - Rust-specific node handling
  • src/language/typescript.rs - TS/JS-specific node handling
  • src/language/go.rs - Go-specific node handling
  • src/language/language_trait.rs - Language trait definitions

Search integration:

  • src/search/file_processing.rs - SearchResult creation
  • src/models.rs - SearchResult and ParentContext models

NPM integration:

  • npm/src/tools/common.js - Shared schemas
  • npm/src/tools/vercel.js - Vercel AI SDK integration
  • npm/src/query.js - Query execution wrapper

Design Decisions

  1. Opt-in via flag: Default behavior unchanged; richer metadata requires explicit --with-context
  2. Source facts only: Probe reports raw metadata; downstream tools interpret semantics
  3. Shared context layer: Both query and search use semantic_context module
  4. Backward compatible: Existing JSON output structure preserved; new fields are additive
  5. Language-agnostic: Works across all supported languages using tree-sitter

Out of Scope (Explicitly Deferred)

  • Batch pattern execution / --patterns-file
  • New search --semantic-blocks behavior
  • New extract --semantic-block behavior
  • Wildcard query syntax changes
  • Policy interpretation of comments, requirements, or test frameworks

Review Notes

Key areas to focus on:

  1. src/semantic_context.rs:130-180 - AST traversal and owner detection logic
  2. src/semantic_context.rs:200-350 - Multi-language symbol extraction patterns
  3. src/semantic_context.rs:750-880 - Lexical scanner for match classification
  4. src/query.rs:396-443 - Query JSON formatting with context
  5. src/search/search_output.rs:574-680 - Search JSON integration
  6. src/extract/symbols.rs:144-165 - TS/JS export wrapper handling

Performance considerations:

  • File parsing is cached per query run via HashMap<PathBuf, Option<ParsedSourceContext>>
  • Each file is parsed once, then reused for all matches in that file
  • Comment and string classification uses lexical scanning as fallback when AST node info insufficient

Testing strategy:

  • Mixed TS/Go test fixtures validate cross-language behavior
  • Negative test cases ensure string literals and loose comments are classified correctly
  • Tests verify Probe does not add domain-specific interpretation

References

Core implementation:

  • src/semantic_context.rs:1-982 - Complete new module
  • src/query.rs:327-443 - Query context integration
  • src/search/search_output.rs:563-727 - Search JSON enhancements
  • src/extract/symbols.rs:106-165 - Symbol extraction fixes

CLI and integration:

  • src/cli.rs:370-380 - CLI flag definition
  • src/main.rs:846-871 - Main wiring
  • npm/src/query.js:18-27 - NPM flag mapping

Tests:

  • tests/json_format_tests.rs:95-365 - Search regressions
  • tests/query_command_json_tests.rs:311-401 - Query regressions
  • src/extract/symbols.rs:629-665 - Symbol extraction tests

Documentation:

  • docs/probe-cli/query.md:119-225 - CLI documentation
  • docs/reference/output-formats.md:56-183 - Output format reference
  • docs/probe-agent/sdk/tools-reference.md:163-190 - SDK reference
Metadata
  • Review Effort: 4 / 5
  • Primary Label: feature

Powered by Visor from Probelabs

Last updated: 2026-05-06T10:57:22.776Z | Triggered by: pr_updated | Commit: 647705b

💡 TIP: You can chat with Visor using /visor ask <your question>

@probelabs
Copy link
Copy Markdown
Contributor

probelabs Bot commented May 6, 2026

\n\n

Architecture Issues (10)

Severity Location Issue
🟠 Error src/semantic_context.rs:570
extract_owner_symbol_from_source() duplicates logic already present in search_output.rs::extract_owner_symbol(). Both functions parse code strings to extract symbol names using similar patterns (func, fn, def, class, etc.). This creates maintenance burden and inconsistency risks.
💡 SuggestionConsolidate symbol extraction logic into a single shared utility. The search_output.rs version could be refactored to call semantic_context::extract_owner_symbol_from_source, or both could delegate to a common parser abstraction.
🟡 Warning src/semantic_context.rs:982
The new semantic_context.rs module is 982 lines with significant complexity. It implements multiple responsibilities: AST node traversal, comment extraction, lexical analysis for string/comment classification, symbol extraction, and context building. This violates single responsibility principle and creates a large, difficult-to-maintain module.
💡 SuggestionConsider splitting this module into smaller, focused modules: (1) AST traversal utilities, (2) Comment extraction, (3) Lexical analysis for match classification, (4) Symbol/name extraction, (5) Context building orchestration. Each would have clear boundaries and be independently testable.
🟡 Warning src/semantic_context.rs:680
The classify_match_kind_at() function implements a custom lexical state machine to distinguish between code, comments, and strings. This is a special-case solution that reimplements language-specific parsing logic already handled by tree-sitter parsers.
💡 SuggestionUse tree-sitter's existing node type information to determine if a match is in a comment or string literal. The AST already knows node types like 'comment', 'string', 'string_fragment' - leverage that instead of reimplementing lexical analysis.
🟡 Warning src/semantic_context.rs:820
normalize_owner_node() contains special-case logic for variable_declarator and arguments nodes. This pattern of checking specific parent/child relationships creates fragile code that breaks with language grammar changes.
💡 SuggestionReplace hardcoded parent/child checks with a more general abstraction. Consider a trait or strategy pattern where each language can define its own owner normalization rules, or use tree-sitter's field name queries more generically.
🟡 Warning src/search/search_output.rs:915
classify_scope() was partially refactored to use classify_scope_from_node() from semantic_context, but only for the shared_scope case. The remaining function-specific logic (comment blocks attached to functions, etc.) creates inconsistency where some scope classification is centralized but some remains local.
💡 SuggestionComplete the refactoring by moving all scope classification logic into semantic_context. The local checks for function signatures in code content should also be centralized to ensure consistent behavior across search and query commands.
🟡 Warning src/query.rs:388
format_and_print_query_results() now accepts pattern and with_context parameters that are only used for JSON output. This creates unnecessary coupling - the function signature changes for all formats when only JSON needs these parameters.
💡 SuggestionRefactor to use a context object or builder pattern for JSON-specific metadata. Alternatively, split JSON formatting into a separate function that takes the additional parameters, keeping the main formatter signature simple.
🟡 Warning src/semantic_context.rs:140
extract_owner_symbol_from_source() contains extensive language-specific pattern matching with hardcoded prefixes like 'export function ', 'async function ', 'export const ', etc. This creates a maintenance burden and doesn't scale well across languages.
💡 SuggestionReplace hardcoded string patterns with a data-driven approach using language-specific configuration. Define patterns in a structured format (e.g., HashMap<Language, Vec<Pattern>>) that can be extended without modifying core logic.
🟡 Warning src/semantic_context.rs:320
leading_comments_from_block() implements custom comment parsing logic with state tracking for block comments. This duplicates tree-sitter's comment node tracking and creates edge cases (e.g., nested block comments, HTML comments).
💡 SuggestionUse tree-sitter's comment nodes directly instead of manual parsing. Query the AST for comment nodes preceding the target node, which would be more reliable and handle all language-specific comment syntaxes correctly.
🟡 Warning src/extract/symbols.rs:109
semantic_symbol_node() specifically handles 'export_statement' and 'declare_statement' wrapper nodes for TypeScript/JavaScript. This is a language-specific special case that doesn't generalize to other languages with similar wrapper patterns.
💡 SuggestionGeneralize this pattern by adding a language trait method like 'unwrap_semantic_node()' that each language implementation can override. This would make the abstraction consistent across all languages rather than hardcoding TS/JS behavior.
🟡 Warning src/semantic_context.rs:1
The ParsedSourceContext struct and its methods mix parsing, caching, and context extraction concerns. The parse() method returns Option but doesn't distinguish between parse failures and unsupported languages, making error handling opaque.
💡 SuggestionSeparate parsing from context extraction. Create a Result-based parse function that can distinguish errors, and a separate context builder that takes parsed AST. Add explicit language support checking rather than failing silently with None.

Performance Issues (7)

Severity Location Issue
🟡 Warning src/query.rs:398-445
Query JSON output with --with-context may parse the same file multiple times. The HashMap cache stores Option<ParsedSourceContext> and calls ParsedSourceContext::parse() inside or_insert_with for each match. For queries returning many matches from the same file, this causes redundant file I/O and tree-sitter parsing.
💡 SuggestionRestructure the caching to parse each file only once. The current implementation calls parse() repeatedly for each result from the same file. Consider parsing once and storing the actual ParsedSourceContext result rather than calling parse() multiple times.
🟡 Warning src/search/search_output.rs:630-670
Search JSON output may parse the same file multiple times when processing multiple results from the same file. The parsed_files HashMap stores Option<ParsedSourceContext> and calls ParsedSourceContext::parse() for each result, potentially causing redundant file I/O and parsing.
💡 SuggestionRestructure the caching to parse each file only once. The current implementation calls parse() inside the or_insert_with closure for every result from the same file, which is inefficient.
🟡 Warning src/semantic_context.rs:818-870
classify_text_matches_in_block performs repeated string allocations and to_lowercase() conversions for each keyword on each line. For large code blocks with many keywords, this creates significant temporary string allocations.
💡 SuggestionConsider using case-insensitive matching without allocating new strings, or normalize the code once and search against it. The current approach calls to_lowercase() on both needle and haystack for every match attempt.
🟡 Warning src/semantic_context.rs:642-680
find_search_owner_node collects all owner nodes in a line range into a Vec, then finds the minimum. This traverses potentially large portions of the AST and allocates memory for all candidates before selecting one.
💡 SuggestionConsider tracking the best candidate during traversal instead of collecting all candidates. This would avoid allocating the Vec and storing all intermediate nodes.
🟡 Warning src/semantic_context.rs:770-810
classify_match_kind_at performs character-by-character scanning of all code lines up to the target position. For large code blocks, this scans potentially thousands of characters to classify a single match.
💡 SuggestionConsider using the tree-sitter AST to determine if a position is in a comment or string literal, rather than manual lexical scanning. The AST already has this information.
🟡 Warning src/semantic_context.rs:872-920
scan_line_until performs character-by-character scanning with manual state tracking. This is reimplementing lexical analysis that tree-sitter already provides.
💡 SuggestionUse tree-sitter's node information to determine if a byte offset is within a comment or string node. The AST already has this information parsed.
🟡 Warning src/semantic_context.rs:410-440
extract_owner_symbol_from_source iterates through lines with multiple string operations for many different patterns. This is called frequently and performs many temporary string allocations.
💡 SuggestionConsider using regex compilation or a more efficient pattern matching approach. The current approach performs many string operations even when no match is found.

Quality Issues (11)

Severity Location Issue
🟡 Warning src/semantic_context.rs:898
Function extract_owner_symbol_from_source iterates over first 20 lines using take(20) without documenting why this limit is appropriate. This arbitrary cutoff could miss symbol declarations in longer function prologues or complex decorators.
💡 SuggestionAdd documentation explaining the 20-line limit is a heuristic for performance, or make it a configurable constant. Consider whether this should instead search until finding the first non-comment, non-blank line with a declaration.
🟡 Warning tests/json_format_tests.rs:263
Test assertion expects exactly 5 results but does not document why this specific count is expected. If fixture data changes, this test will fail without clear indication of what the correct count should be.
💡 SuggestionAdd a comment explaining why 5 results are expected, or use a more flexible assertion that validates the presence of specific expected results rather than an exact count.
🟡 Warning src/semantic_context.rs:898
Function extract_owner_symbol_from_source has high cyclomatic complexity with multiple nested if-let chains and pattern matching. This makes the function difficult to test, maintain, and extend for new language patterns.
💡 SuggestionRefactor into smaller helper functions for each language or pattern type. Consider a strategy pattern where language-specific extractors are registered and tried in sequence.
🟡 Warning src/semantic_context.rs:898
The extract_owner_symbol_from_source function contains duplicated logic for extracting names after keywords. Multiple calls to extract_name_after_keyword with different prefixes could be consolidated.
💡 SuggestionCreate a list of prefix patterns and iterate over them, or use a macro to generate the repeated pattern matching code.
🟡 Warning src/semantic_context.rs:898
The test module in semantic_context.rs lacks comprehensive negative test cases. While test_classify_text_matches_in_comments_and_strings tests positive cases, there are no tests for edge cases like empty code, code with only comments, or malformed source that could cause panics.
💡 SuggestionAdd tests for negative cases: empty strings, code with no symbols, code with only whitespace/comments, and malformed syntax that should not crash the parser.
🟡 Warning src/semantic_context.rs:898
No tests verify error handling in ParsedSourceContext::parse when file reading fails or parsing fails. The function uses Option extensively but tests do not verify graceful failure.
💡 SuggestionAdd tests that verify parse() returns None for non-existent files, unreadable files, or files with syntax errors that cause tree parsing to fail.
🟡 Warning src/semantic_context.rs:898
The semantic_context module mixes concerns: language-specific symbol extraction, comment parsing, lexical analysis for string/comment classification, and AST traversal. This violates single responsibility principle.
💡 SuggestionConsider splitting into submodules: language (symbol extraction), comments (comment parsing), lexical (string/comment classification), and ast (tree traversal utilities).
🟡 Warning src/semantic_context.rs:898
The classify_match_kind_at function performs complex lexical state machine parsing but has no dedicated unit tests. It is only tested indirectly through test_classify_text_matches_in_comments_and_strings.
💡 SuggestionAdd dedicated tests for classify_match_kind_at covering: nested quotes, escaped characters, block comments spanning multiple lines, mixed comment and string on same line, and edge cases like unclosed comments/strings.
🟡 Warning tests/json_format_tests.rs:263
Test test_search_json_no_merge_reports_req_id_source_metadata has complex fixture setup with multiple files and many assertions. This makes the test brittle - changes to output format or fixture structure will require updating many assertions.
💡 SuggestionBreak this test into smaller, focused tests. Create helper functions to reduce duplication. Consider using table-driven tests for the different result type validations.
🟡 Warning src/semantic_context.rs:898
The scan_line_until function uses byte indexing with idx += ch.len_utf8() but does not validate that idx never exceeds line.len(). While the loop condition should prevent this, there is no explicit bounds checking.
💡 SuggestionAdd explicit bounds checking or use a safer iteration pattern. Consider adding debug assertions to catch potential out-of-bounds access in testing.
🟡 Warning src/extract/symbols.rs:109
Test test_exported_typescript_symbols_use_inner_declaration_names only tests positive cases where export_statement wrappers are successfully unwrapped. It does not test edge cases like nested exports, re-exports, or export statements without semantic declarations.
💡 SuggestionAdd tests for: export of already-exported symbols, re-export statements (export { X } from './y'), type-only exports, and default exports to ensure semantic_symbol_node handles all cases correctly.

Powered by Visor from Probelabs

Last updated: 2026-05-06T10:50:52.789Z | Triggered by: pr_updated | Commit: 647705b

💡 TIP: You can chat with Visor using /visor ask <your question>

@buger
Copy link
Copy Markdown
Collaborator Author

buger commented May 6, 2026

Thanks, this PR now covers the important first slice for Proof's source-discovery path. The scoped split in the description is right: Probe should expose source facts, while Proof keeps requirement/test-policy interpretation and language-specific instrumentation.

For Proof's planned cleanup, this is enough to move req-id text discovery/autolink toward Probe:

  • search --allow-tests --no-merge -o json can return annotation-bearing blocks.
  • Search JSON now exposes language, owner_symbol, leading_comments, and classified matches.
  • That lets Proof reject string-literal hits and apply its own Implements / Verifies / MCDC comment policy without reparsing every file.
  • Common TS/JS owners like methods and exported const arrows are covered.

What is still missing before Proof can remove the remaining non-instrumentation AST/source readers:

  1. Search-level enclosing context

    query --with-context exposes owner.qualified_symbol, owner.enclosing_symbols, and owner.enclosing_calls, but search only exposes owner_symbol.

    For Proof, req-id discovery starts with text search (SYS-REQ-*, SW-REQ-*, etc.), not structural query patterns. When a req-id appears in a JS/TS callback block, Proof still needs generic context like:

    {
      "owner_symbol": null,
      "enclosing_calls": [
        {"callee": "describe", "first_arg_literal": "normalization", "line": 13},
        {"callee": "it", "first_arg_literal": "normalizes decisions", "line": 15}
      ]
    }

    Probe should not interpret this as a test framework. It only needs to expose the AST call chain. Proof can decide whether describe / it matters.

  2. Search-level qualified owners / containing symbols

    Search can now return owner_symbol, but Proof still cannot reliably distinguish PolicyService.evaluatePolicy from another evaluatePolicy in the same file/module without either reparsing or accepting weaker evidence identity.

    A lightweight search-side equivalent of the query owner fields would be enough:

    {
      "owner_symbol": "evaluatePolicy",
      "owner_qualified_symbol": "PolicyService.evaluatePolicy",
      "enclosing_symbols": [
        {"kind": "class", "name": "PolicyService", "line": 1}
      ]
    }
  3. Search regression for negative text hits

    The new search regression covers the positive path well. It would be useful to also assert the negative/source-classification cases through the real search JSON path:

    • req id inside a string -> matches[].kind == "string"
    • req id in a loose non-annotation comment -> matches[].kind == "comment" but no domain interpretation from Probe
    • leading annotation comment -> matches[].kind == "comment" and comment_role == "leading"

    Proof can then make its policy decision entirely from Probe JSON.

  4. Semantic block mode remains a later gap

    --no-merge is good enough for the first integration. Longer term, an explicit search mode that guarantees one semantic owner per result would let evidence tooling avoid relying on merge heuristics:

    probe search --semantic-blocks --allow-tests -o json '"SYS-REQ-424"'

    Not asking for that in this PR, just calling out that it remains the thing that would let Proof remove more local source-block cleanup.

With those additions, Proof could use Probe as the generic source-reader for trace/evidence discovery across Go/JS/TS and reserve local AST parsing for cases that truly require transformation or language execution semantics, such as MC/DC instrumentation and coverage/tool-specific processing.

@buger buger merged commit 6839a0c into main May 6, 2026
18 of 19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant