Skip to content

perf(learning/user_profile): Aho-Corasick DFA for preference extraction + pattern coverage#2878

Merged
senamakel merged 2 commits into
tinyhumansai:mainfrom
mysma-9403:perf/user-profile-aho-corasick
May 29, 2026
Merged

perf(learning/user_profile): Aho-Corasick DFA for preference extraction + pattern coverage#2878
senamakel merged 2 commits into
tinyhumansai:mainfrom
mysma-9403:perf/user-profile-aho-corasick

Conversation

@mysma-9403
Copy link
Copy Markdown
Contributor

@mysma-9403 mysma-9403 commented May 28, 2026

Summary

Replaces the str::contains-per-pattern preference scan in UserProfileHook with a single Aho-Corasick DFA pass per sentence, eliminates the previous per-turn String allocations from re-lowercasing, and expands both the pattern coverage and the sentence delimiter set so more real-world preference statements are captured.

This is the same class of optimization as #2842 (prompt-injection RegexSet) and #2870 (routing/quality AC) — moving a hot-path "scan input against N curated patterns" loop from O(N) substring sweeps over a re-allocated lowercase buffer to a single byte-level DFA pass.

What changes

src/openhuman/learning/user_profile.rs:

  • DFA built once. PREFERENCE_DFA: LazyLock<AhoCorasick> compiled with ascii_case_insensitive(true) + MatchKind::LeftmostFirst. No to_lowercase() of the message or per-sentence trim/lowercase pass any more.
  • Word-boundary check. The byte immediately after each match must be non-ASCII-alphanumeric and at least one byte of trailing content must exist. This rejects:
    • "I preferred X" (alphanumeric continuation — false positive against "i prefer")
    • degenerate empty-tail fragments like "I prefer" (the residue of splitting "I prefer.") or dangling "I prefer:" — they carry no preference target and would just pollute the user_profile memory namespace with useless pref/i_prefer slugs.
  • Fallback path dropped. The previous lower.starts_with(...) post-loop rescue existed only because the substring sweep couldn't catch the "I prefer:X" shape. Proper word boundaries handle it in the main path.

PREFERENCE_PATTERNS: 20 → 28 entries:

  • Direct preference: i'd prefer, i would prefer, i'd rather, i dislike
  • Habit/instruction: please use
  • Identity/context: my pronouns, my preferred, call me, address me as

Trailing whitespace removed from every pattern (the word-boundary check now does that job).

SENTENCE_DELIMITERS: extended from ['.', '!', '\n'] to ['.', '!', '?', ';', '\n'] so questions and semicolon-joined clauses decompose correctly ("What's your view? I prefer Rust.", "OK; I prefer Rust."). : is intentionally not a delimiter so "My role: engineer" stays one sentence the "my role" pattern can match.

Cargo.toml: adds aho-corasick = "1.1" as a direct dependency (already pulled in transitively via regex).

Why this matters

UserProfileHook runs on every user turn as a PostTurnHook. The old code shape was:

let lower = message.to_lowercase();          // 1 String alloc
for sentence in message.split(...) {
    let sentence_lower = trimmed.to_lowercase(); // N String allocs
    for pattern in PREFERENCE_PATTERNS {          // 20 substring scans per sentence
        if sentence_lower.contains(pattern) { ... }
    }
}

For a 5-sentence user message that's 6 String allocations + 100 substring scans, every turn, on the agent hot path — plus you can't even add patterns cheaply because each one extends the inner loop. The new shape is one DFA, one byte-level pass per sentence, no per-call allocation until a match is emitted. Pattern count is no longer a cost dial.

The pattern expansion is the user-visible win: "I'd prefer concise responses", "Call me Alex", "My pronouns are they/them", "My preferred editor is Helix", "Please use snake_case" — all previously slipped through because they didn't match any of the 20 hard-coded openings. That's the kind of preference statement the memory system exists to capture.

The expanded delimiters (?, ;) close another silent gap: a leading question like "What's the timezone situation? My timezone is PST." previously bled the preamble into the preference sentence, which then either failed the length filter or stored noise alongside the actual preference.

Test plan

  • cargo fmt
  • cargo check --manifest-path Cargo.toml
  • cargo test -p openhuman --lib learning::user_profile14 passed
  • Pre-push hook (pnpm rust:check, compile, lint, lint:commands-tokens) — passed

15 unit tests total (was 5):

  • happy path (finds_patterns, handles_single_sentence_message)
  • negative path (handles_no_matches, ignores_short_sentences, rejects_bare_pattern_with_no_content_after)
  • cap enforcement (caps_at_max_per_turn)
  • word-boundary correctness — alphanumeric reject and non-alphanumeric accept
  • expanded delimiters (? and ;)
  • expanded patterns (catches_extended_patterns — one assertion per new pattern category so any future drop fails loudly)
  • Unicode safety (non_ascii_does_not_panic_or_falsely_match — Cyrillic, Polish diacritics, emoji prefix)
  • DFA smoke test (preference_dfa_compiles_and_has_expected_pattern_count — guards against silently dropping a pattern from the slice)
  • existing mocked-storage and on-turn-behaviour tests retained unchanged

Summary by CodeRabbit

  • Improvements
    • Enhanced user preference extraction accuracy with stricter boundary detection to reduce false matches.
    • Improved handling of edge cases and punctuation in preference recognition.
    • Optimized performance for preference matching and extraction.

Review Change Stack

Refactor UserProfileHook::extract_preferences from a per-pattern
str::contains sweep over re-lowercased sentences into a single
case-insensitive Aho-Corasick DFA pass per sentence.

## What changes

- Build one AhoCorasick DFA over all preference opening phrases at
  first use (LazyLock<AhoCorasick> with ascii_case_insensitive(true)
  + MatchKind::LeftmostFirst).
- Drop the previous `let lower = message.to_lowercase()` plus per-
  sentence trimmed.to_lowercase() + per-pattern sentence_lower
  .contains(pattern) chain (6 String allocs + 5xN substring scans per
  5-sentence turn) for a zero-alloc byte-level scan.
- Add a sentence_has_preference() word-boundary check on the match
  end: the byte immediately after the match must be non-ASCII-
  alphanumeric AND at least one byte of trailing content must
  exist. This rejects both "I preferred X" (alphanumeric
  continuation = false match against "i prefer") and degenerate
  empty-tail residue like "I prefer" (the leftover from splitting
  "I prefer.") which carry no preference target.
- Drop the post-loop lower.starts_with(...) fallback — the previous
  implementation needed it to rescue "I prefer:X" shapes that the
  primary sentence-split path couldn't catch. With proper word
  boundaries the main path handles it directly.

## Pattern coverage

Expand PREFERENCE_PATTERNS from 20 → 28 entries (no breakage of
existing matches):

- Direct preference: i'd prefer, i would prefer, i'd rather, i dislike
- Habit/instruction: please use
- Identity/context: my pronouns, my preferred, call me, address me as

Trailing spaces removed from every pattern (the word-boundary check
now does that job).

Expand sentence-delimiter set from ['.', '!', '\n'] to
['.', '!', '?', ';', '\n'] so "What's your view? I prefer Rust."
and "OK; I prefer Rust." are decomposed correctly. ':' is
intentionally NOT a delimiter so "My role: engineer" stays a
single sentence that the "my role" pattern can match.

## Tests

15 unit tests total (was 5):
- happy path (finds_patterns, handles_single_sentence_message)
- negative path (handles_no_matches, ignores_short_sentences,
  rejects_bare_pattern_with_no_content_after)
- cap enforcement (caps_at_max_per_turn)
- word-boundary correctness (alphanumeric reject /
  non-alphanumeric accept)
- expanded delimiters (? and ;)
- expanded patterns (catches_extended_patterns)
- Unicode safety (non_ascii_does_not_panic_or_falsely_match)
- DFA smoke test (preference_dfa_compiles_and_has_expected_pattern_count)
- existing mocked storage + on-turn behaviour tests retained

Adds aho-corasick = "1.1" as a direct dependency (already pulled in
transitively via regex).
@mysma-9403 mysma-9403 requested a review from a team May 28, 2026 22:33
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 28, 2026

📝 Walkthrough

Walkthrough

This PR introduces the aho-corasick crate and refactors user preference extraction to use an allocation-free Aho-Corasick DFA with explicit sentence splitting, word-boundary validation, and per-turn preference caps, replacing a simpler substring-matching approach.

Changes

User Preference Extraction via Aho-Corasick

Layer / File(s) Summary
Aho-Corasick dependency and pattern setup
Cargo.toml, src/openhuman/learning/user_profile.rs (constants, patterns, DFA, boundary helper)
Adds aho-corasick = "1.1" dependency. Introduces SENTENCE_DELIMITERS (. ! ? ; \n excluding :), MIN_SENTENCE_BYTES, and MAX_PREFERENCES_PER_TURN constants. Expands PREFERENCE_PATTERNS and builds a lazy, ASCII case-insensitive AhoCorasick DFA. Adds sentence_has_preference helper that validates word boundaries at the byte level, rejecting matches at sentence end and rejecting alphanumeric continuations to prevent false positives.
Sentence-based extraction with word boundaries
src/openhuman/learning/user_profile.rs (extract_preferences)
Reimplements extract_preferences to split input by SENTENCE_DELIMITERS, filter sentences by MIN_SENTENCE_BYTES, run DFA matching within each valid sentence, apply sentence_has_preference boundary checks, collect matches, and enforce MAX_PREFERENCES_PER_TURN cap.
Comprehensive test coverage
src/openhuman/learning/user_profile.rs (test module)
Replaces and expands tests with cases for single-sentence matching, max-per-turn behavior, word-boundary correctness (alphanumeric rejection vs punctuation acceptance), end-of-sentence and bare-pattern fragment rejection, delimiter semantics (?/; inclusion, : exclusion), extended pattern coverage, Unicode safety without panics, and DFA pattern count validation.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 A pattern match with Aho-Corasick's flair,
Sentence by sentence, with boundary care,
No false positives lurking in the text,
Word-safe extraction—what's next?
Preferences leap from input's delicate snare.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: replacing substring matching with Aho-Corasick DFA for preference extraction and expanding pattern coverage. It is specific, clear, and reflects the primary performance optimization and feature enhancement in the changeset.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
src/openhuman/learning/user_profile.rs (1)

551-559: ⚡ Quick win

Make the pattern-count assertion meaningful (pin expected length)

dfa.patterns_len() will always equal PREFERENCE_PATTERNS.len() because the DFA is built directly from that slice and aho-corasick’s patterns_len counts total compiled patterns (duplicates included). This won’t catch accidental add/drop. PREFERENCE_PATTERNS currently contains 29 entries, so pinning the literal makes the guard real; forcing LazyLock init remains a useful compile/panic smoke check.

♻️ Make the count check meaningful
     fn preference_dfa_compiles_and_has_expected_pattern_count() {
         let dfa = &*PREFERENCE_DFA;
-        assert_eq!(dfa.patterns_len(), PREFERENCE_PATTERNS.len());
+        // Pin the literal so an accidental add/drop of an entry is loud
+        // at CI time; `patterns_len()` alone is always == len() here.
+        assert_eq!(PREFERENCE_PATTERNS.len(), 29);
+        assert_eq!(dfa.patterns_len(), PREFERENCE_PATTERNS.len());
     }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/openhuman/learning/user_profile.rs` around lines 551 - 559, The test
preference_dfa_compiles_and_has_expected_pattern_count currently compares
dfa.patterns_len() to PREFERENCE_PATTERNS.len(), which is tautological; change
the assertion to pin the expected literal count (replace the second operand with
29, the current number of entries) so the test fails if entries are accidentally
added/removed, while still forcing LazyLock init via referencing PREFERENCE_DFA.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/openhuman/learning/user_profile.rs`:
- Around line 551-559: The test
preference_dfa_compiles_and_has_expected_pattern_count currently compares
dfa.patterns_len() to PREFERENCE_PATTERNS.len(), which is tautological; change
the assertion to pin the expected literal count (replace the second operand with
29, the current number of entries) so the test fails if entries are accidentally
added/removed, while still forcing LazyLock init via referencing PREFERENCE_DFA.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 6bc3a6c9-944a-44bb-8f17-e6fff39f1d58

📥 Commits

Reviewing files that changed from the base of the PR and between 972e327 and 8c486b4.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (2)
  • Cargo.toml
  • src/openhuman/learning/user_profile.rs

coderabbitai[bot]
coderabbitai Bot previously approved these changes May 28, 2026
Copy link
Copy Markdown
Contributor

@graycyrus graycyrus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mysma-9403 hey! the code looks good to me, but CI is still pending on several checks (Rust Core Tests, coverage runs, E2E Appium suites, frontend unit tests). once those are green, I'll come back and approve this. let me know if you need any help!

For reference, what I reviewed: the Aho-Corasick DFA swap is a clean, well-justified optimization — the word-boundary logic is correct, the LazyLock initialization is safe, LeftmostFirst + find_iter + .any() short-circuits correctly, and the 15-test suite covers the important edge cases (alphanumeric continuation rejection, bare-pattern empty-tail rejection, expanded delimiters, Unicode safety, DFA pattern count guard). The direct aho-corasick = "1.1" dep is the right call — it's already in the transitive tree via regex, BurntSushi's package, MIT/Unlicense, actively maintained.

senamakel
senamakel previously approved these changes May 29, 2026
@senamakel senamakel dismissed stale reviews from coderabbitai[bot] and themself via 95e6f6a May 29, 2026 03:53
@senamakel senamakel merged commit dae1ef3 into tinyhumansai:main May 29, 2026
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants