perf(learning/user_profile): Aho-Corasick DFA for preference extraction + pattern coverage#2878
Conversation
Refactor UserProfileHook::extract_preferences from a per-pattern str::contains sweep over re-lowercased sentences into a single case-insensitive Aho-Corasick DFA pass per sentence. ## What changes - Build one AhoCorasick DFA over all preference opening phrases at first use (LazyLock<AhoCorasick> with ascii_case_insensitive(true) + MatchKind::LeftmostFirst). - Drop the previous `let lower = message.to_lowercase()` plus per- sentence trimmed.to_lowercase() + per-pattern sentence_lower .contains(pattern) chain (6 String allocs + 5xN substring scans per 5-sentence turn) for a zero-alloc byte-level scan. - Add a sentence_has_preference() word-boundary check on the match end: the byte immediately after the match must be non-ASCII- alphanumeric AND at least one byte of trailing content must exist. This rejects both "I preferred X" (alphanumeric continuation = false match against "i prefer") and degenerate empty-tail residue like "I prefer" (the leftover from splitting "I prefer.") which carry no preference target. - Drop the post-loop lower.starts_with(...) fallback — the previous implementation needed it to rescue "I prefer:X" shapes that the primary sentence-split path couldn't catch. With proper word boundaries the main path handles it directly. ## Pattern coverage Expand PREFERENCE_PATTERNS from 20 → 28 entries (no breakage of existing matches): - Direct preference: i'd prefer, i would prefer, i'd rather, i dislike - Habit/instruction: please use - Identity/context: my pronouns, my preferred, call me, address me as Trailing spaces removed from every pattern (the word-boundary check now does that job). Expand sentence-delimiter set from ['.', '!', '\n'] to ['.', '!', '?', ';', '\n'] so "What's your view? I prefer Rust." and "OK; I prefer Rust." are decomposed correctly. ':' is intentionally NOT a delimiter so "My role: engineer" stays a single sentence that the "my role" pattern can match. ## Tests 15 unit tests total (was 5): - happy path (finds_patterns, handles_single_sentence_message) - negative path (handles_no_matches, ignores_short_sentences, rejects_bare_pattern_with_no_content_after) - cap enforcement (caps_at_max_per_turn) - word-boundary correctness (alphanumeric reject / non-alphanumeric accept) - expanded delimiters (? and ;) - expanded patterns (catches_extended_patterns) - Unicode safety (non_ascii_does_not_panic_or_falsely_match) - DFA smoke test (preference_dfa_compiles_and_has_expected_pattern_count) - existing mocked storage + on-turn behaviour tests retained Adds aho-corasick = "1.1" as a direct dependency (already pulled in transitively via regex).
📝 WalkthroughWalkthroughThis PR introduces the ChangesUser Preference Extraction via Aho-Corasick
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
src/openhuman/learning/user_profile.rs (1)
551-559: ⚡ Quick winMake the pattern-count assertion meaningful (pin expected length)
dfa.patterns_len()will always equalPREFERENCE_PATTERNS.len()because the DFA is built directly from that slice andaho-corasick’spatterns_lencounts total compiled patterns (duplicates included). This won’t catch accidental add/drop.PREFERENCE_PATTERNScurrently contains 29 entries, so pinning the literal makes the guard real; forcingLazyLockinit remains a useful compile/panic smoke check.♻️ Make the count check meaningful
fn preference_dfa_compiles_and_has_expected_pattern_count() { let dfa = &*PREFERENCE_DFA; - assert_eq!(dfa.patterns_len(), PREFERENCE_PATTERNS.len()); + // Pin the literal so an accidental add/drop of an entry is loud + // at CI time; `patterns_len()` alone is always == len() here. + assert_eq!(PREFERENCE_PATTERNS.len(), 29); + assert_eq!(dfa.patterns_len(), PREFERENCE_PATTERNS.len()); }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/openhuman/learning/user_profile.rs` around lines 551 - 559, The test preference_dfa_compiles_and_has_expected_pattern_count currently compares dfa.patterns_len() to PREFERENCE_PATTERNS.len(), which is tautological; change the assertion to pin the expected literal count (replace the second operand with 29, the current number of entries) so the test fails if entries are accidentally added/removed, while still forcing LazyLock init via referencing PREFERENCE_DFA.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@src/openhuman/learning/user_profile.rs`:
- Around line 551-559: The test
preference_dfa_compiles_and_has_expected_pattern_count currently compares
dfa.patterns_len() to PREFERENCE_PATTERNS.len(), which is tautological; change
the assertion to pin the expected literal count (replace the second operand with
29, the current number of entries) so the test fails if entries are accidentally
added/removed, while still forcing LazyLock init via referencing PREFERENCE_DFA.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 6bc3a6c9-944a-44bb-8f17-e6fff39f1d58
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (2)
Cargo.tomlsrc/openhuman/learning/user_profile.rs
graycyrus
left a comment
There was a problem hiding this comment.
@mysma-9403 hey! the code looks good to me, but CI is still pending on several checks (Rust Core Tests, coverage runs, E2E Appium suites, frontend unit tests). once those are green, I'll come back and approve this. let me know if you need any help!
For reference, what I reviewed: the Aho-Corasick DFA swap is a clean, well-justified optimization — the word-boundary logic is correct, the LazyLock initialization is safe, LeftmostFirst + find_iter + .any() short-circuits correctly, and the 15-test suite covers the important edge cases (alphanumeric continuation rejection, bare-pattern empty-tail rejection, expanded delimiters, Unicode safety, DFA pattern count guard). The direct aho-corasick = "1.1" dep is the right call — it's already in the transitive tree via regex, BurntSushi's package, MIT/Unlicense, actively maintained.
95e6f6a
Summary
Replaces the
str::contains-per-pattern preference scan inUserProfileHookwith a single Aho-Corasick DFA pass per sentence, eliminates the previous per-turnStringallocations from re-lowercasing, and expands both the pattern coverage and the sentence delimiter set so more real-world preference statements are captured.This is the same class of optimization as #2842 (prompt-injection RegexSet) and #2870 (routing/quality AC) — moving a hot-path "scan input against N curated patterns" loop from O(N) substring sweeps over a re-allocated lowercase buffer to a single byte-level DFA pass.
What changes
src/openhuman/learning/user_profile.rs:PREFERENCE_DFA: LazyLock<AhoCorasick>compiled withascii_case_insensitive(true)+MatchKind::LeftmostFirst. Noto_lowercase()of the message or per-sentence trim/lowercase pass any more."I preferred X"(alphanumeric continuation — false positive against"i prefer")"I prefer"(the residue of splitting"I prefer.") or dangling"I prefer:"— they carry no preference target and would just pollute theuser_profilememory namespace with uselesspref/i_preferslugs.lower.starts_with(...)post-loop rescue existed only because the substring sweep couldn't catch the"I prefer:X"shape. Proper word boundaries handle it in the main path.PREFERENCE_PATTERNS: 20 → 28 entries:i'd prefer,i would prefer,i'd rather,i dislikeplease usemy pronouns,my preferred,call me,address me asTrailing whitespace removed from every pattern (the word-boundary check now does that job).
SENTENCE_DELIMITERS: extended from['.', '!', '\n']to['.', '!', '?', ';', '\n']so questions and semicolon-joined clauses decompose correctly ("What's your view? I prefer Rust.","OK; I prefer Rust.").:is intentionally not a delimiter so"My role: engineer"stays one sentence the"my role"pattern can match.Cargo.toml: addsaho-corasick = "1.1"as a direct dependency (already pulled in transitively viaregex).Why this matters
UserProfileHookruns on every user turn as aPostTurnHook. The old code shape was:For a 5-sentence user message that's 6
Stringallocations + 100 substring scans, every turn, on the agent hot path — plus you can't even add patterns cheaply because each one extends the inner loop. The new shape is one DFA, one byte-level pass per sentence, no per-call allocation until a match is emitted. Pattern count is no longer a cost dial.The pattern expansion is the user-visible win:
"I'd prefer concise responses","Call me Alex","My pronouns are they/them","My preferred editor is Helix","Please use snake_case"— all previously slipped through because they didn't match any of the 20 hard-coded openings. That's the kind of preference statement the memory system exists to capture.The expanded delimiters (
?,;) close another silent gap: a leading question like"What's the timezone situation? My timezone is PST."previously bled the preamble into the preference sentence, which then either failed the length filter or stored noise alongside the actual preference.Test plan
cargo fmtcargo check --manifest-path Cargo.tomlcargo test -p openhuman --lib learning::user_profile— 14 passedpnpm rust:check,compile,lint,lint:commands-tokens) — passed15 unit tests total (was 5):
finds_patterns,handles_single_sentence_message)handles_no_matches,ignores_short_sentences,rejects_bare_pattern_with_no_content_after)caps_at_max_per_turn)?and;)catches_extended_patterns— one assertion per new pattern category so any future drop fails loudly)non_ascii_does_not_panic_or_falsely_match— Cyrillic, Polish diacritics, emoji prefix)preference_dfa_compiles_and_has_expected_pattern_count— guards against silently dropping a pattern from the slice)Summary by CodeRabbit