feat(detector): malicious-file detection rules engine + telemetry wiring#135
Merged
ashishkurmi merged 5 commits intoJun 9, 2026
Merged
Conversation
Add a data-driven, rule-based file-detection engine to the agent. On each enterprise run the agent fetches backend-authored detection rules, evaluates them against the directories it already walks, and reports matches as an additive rule_scan field on the existing telemetry payload. New IOCs ship as backend rule data, not agent releases. - New internal/detector/rules package: RuleSet/Rule/group/condition types with Prepare() (validate + compile RE2), glob->RE2 conversion, root resolution with symlink-escape guard, condition eval (regex / sha256 / negate, booleans only — never uploads or logs file content), and the Engine seam (pure Scan over a RuleSet + search dirs, fully unit-testable with no backend). - Mandatory vs optional conditions: a group is satisfied only when all its mandatory conditions match; a file is flagged only if it has no conditions or some group is satisfied. Optional conditions only affect confidence. - Caps + completeness signals (per-rule match cap, global file/time budget), size guard, and rule fetch that returns an empty RuleSet on any failure so a missing/!200 rules API never fails the run. - model.RuleScan result types; telemetry phase wiring (enterprise-only, no new feature flag) + phase budget. - Hidden dev flags --rules-file / --telemetry-out for offline testing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds a backend-authored, data-driven malicious-file detection rules engine to the enterprise agent, executes it during telemetry collection, and reports findings via a new additive rule_scan field in the existing telemetry payload.
Changes:
- Introduces
internal/detector/ruleswith RuleSet validation/compilation, glob→RE2 matching, filesystem walk + caps/budgets, and backend rule fetch/load helpers. - Wires rules fetching + scanning into
telemetry.Run(enterprise-only) and adds a per-phase deadline formalicious_file_scan. - Extends telemetry/model wire contract (
model.RuleScanet al.) and adds dev-only flags (--rules-file,--telemetry-out) plus tests.
Reviewed changes
Copilot reviewed 21 out of 21 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| internal/telemetry/telemetry.go | Adds rule_scan to payload; fetches rules and runs scan phase; adds --telemetry-out payload dump helper. |
| internal/telemetry/telemetry_out_test.go | Tests JSON dump/round-trip and omission of rule_scan when nil. |
| internal/telemetry/phase_deadline.go | Adds phase budget for malicious_file_scan. |
| internal/model/model.go | Adds RuleScan/result types and FileAttrs to telemetry wire contract. |
| internal/detector/rules/doc.go | Documents rules engine trust/privacy model and operational constraints. |
| internal/detector/rules/ruleset.go | Defines rule schema + Prepare() validation/compilation and max file-size clamping. |
| internal/detector/rules/ruleset_test.go | Unit tests for Prepare() validation, clamping, and glob matching. |
| internal/detector/rules/glob.go | Implements glob validation, absolute/relative handling, and glob→anchored RE2 conversion. |
| internal/detector/rules/match.go | Evaluates regex/SHA256 conditions and group satisfaction semantics. |
| internal/detector/rules/fileattrs.go | Produces FileAttrs from os.FileInfo. |
| internal/detector/rules/fileattrs_darwin.go | macOS birth/ctime extraction. |
| internal/detector/rules/fileattrs_linux.go | Linux ctime extraction (no birth time). |
| internal/detector/rules/fileattrs_windows.go | Windows creation time extraction. |
| internal/detector/rules/fileattrs_other.go | Fallback for platforms without portable birth/ctime. |
| internal/detector/rules/fetch.go | Adds HTTP fetcher + fail-safe FetchOrEmpty and dev LoadFileOrEmpty. |
| internal/detector/rules/fetch_test.go | Tests fetch success/failure modes and dev file loading. |
| internal/detector/rules/roots.go | Implements absolute glob resolution and root walks with TCC-aware skipping. |
| internal/detector/rules/engine.go | Implements scan engine, caps, completeness signals, and file read/hash caching. |
| internal/detector/rules/engine_test.go | Extensive engine tests: regex/sha/negate, caps, completeness, absolute globs, privacy. |
| internal/cli/cli.go | Adds dev-only flags + env var fallbacks for rules file and telemetry dump. |
| internal/cli/cli_devflags_test.go | Tests dev flag parsing and env-var fallback behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+77
to
+84
| id := strings.TrimSpace(r.ID) | ||
| if id == "" { | ||
| return fmt.Errorf("rule[%d]: empty id", i) | ||
| } | ||
| if seenRules[id] { | ||
| return fmt.Errorf("rule %q: duplicate id", id) | ||
| } | ||
| seenRules[id] = true |
Comment on lines
+106
to
+113
| gid := strings.TrimSpace(grp.ID) | ||
| if gid == "" { | ||
| return fmt.Errorf("rule %q: group[%d]: empty id", id, gi) | ||
| } | ||
| if seenGroups[gid] { | ||
| return fmt.Errorf("rule %q: group %q: duplicate id", id, gid) | ||
| } | ||
| seenGroups[gid] = true |
Comment on lines
+121
to
+128
| cid := strings.TrimSpace(c.ID) | ||
| if cid == "" { | ||
| return fmt.Errorf("rule %q group %q: condition[%d]: empty id", id, gid, ci) | ||
| } | ||
| if seenConds[cid] { | ||
| return fmt.Errorf("rule %q group %q: condition %q: duplicate id", id, gid, cid) | ||
| } | ||
| seenConds[cid] = true |
Comment on lines
+101
to
+107
| for _, m := range matchers { | ||
| if m.cg.re.MatchString(relSlashed) { | ||
| if e.evaluate(st, m.rstate, path, m.cg.raw) { | ||
| return errWalkStop | ||
| } | ||
| } | ||
| } |
Comment on lines
+906
to
+910
| // Dev-only offline harness: dump the assembled Payload to a local | ||
| // file and skip the upload + run-status notify entirely. This is exactly | ||
| // the inner JSON process-uploaded sees after gunzip, so it doubles as a | ||
| // backend ingestion fixture. Never set in production (zero impact when | ||
| // the flag/env var is unset). |
Comment on lines
+216
to
+242
| // fileCache reads each candidate file at most once per scan (a file matched by | ||
| // several rules is read once), caching its bytes and whole-file SHA-256. | ||
| type fileCache struct { | ||
| m map[string]cachedFile | ||
| } | ||
|
|
||
| type cachedFile struct { | ||
| data []byte | ||
| hash string | ||
| ok bool | ||
| } | ||
|
|
||
| func newFileCache() *fileCache { return &fileCache{m: make(map[string]cachedFile)} } | ||
|
|
||
| func (fc *fileCache) read(exec executor.Executor, path string) (data []byte, hash string, ok bool) { | ||
| if c, found := fc.m[path]; found { | ||
| return c.data, c.hash, c.ok | ||
| } | ||
| b, err := exec.ReadFile(path) | ||
| if err != nil { | ||
| fc.m[path] = cachedFile{} | ||
| return nil, "", false | ||
| } | ||
| sum := sha256.Sum256(b) | ||
| h := hex.EncodeToString(sum[:]) | ||
| fc.m[path] = cachedFile{data: b, hash: h, ok: true} | ||
| return b, h, true |
Resolve conflict in internal/telemetry/telemetry.go: keep the malicious-file detection RuleScan payload field/assignment alongside the new pnpm/bun/yarn audit fields added upstream. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Test binaries and MSIs are ready.
|
fileCache previously memoized the full bytes + SHA-256 of every glob-matched file for the entire scan with no eviction, so a broad glob over a large tree could retain up to MaxFiles (50k) x MaxFileSize (8 MiB) of bytes and spike memory / OOM. Every rule whose globs match a given path is already evaluated consecutively (one WalkDir callback, or one resolveAbsolute path), so the only real benefit of the cache — reading + hashing a file once when several rules match it — is local to a single file. Shrink fileCache to a single-slot cache of the current file: the multi-rule read/hash dedup is preserved, but each file's bytes are released as soon as the walk moves to the next path, bounding peak memory to one file (<= MaxFileSize). No behavior change (same matches, same caps, same per-rule size guard / seen / truncation / file-budget semantics). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
varunsh-coder
approved these changes
Jun 9, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add a data-driven, rule-based file-detection engine to the agent. On each enterprise run the agent fetches backend-authored detection rules, evaluates them against the directories it already walks, and reports matches as an additive rule_scan field on the existing telemetry payload. New IOCs ship as backend rule data, not agent releases.
What does this PR do?
Type of change
Testing
./stepsecurity-dev-machine-guard --verbose./stepsecurity-dev-machine-guard --json | python3 -m json.toolmake lintmake testRelated Issues