perf(learning/user_profile): Aho-Corasick DFA for preference extraction + pattern coverage by mysma-9403 · Pull Request #2878 · tinyhumansai/openhuman

mysma-9403 · 2026-05-28T22:33:53Z

Summary

Replaces the str::contains-per-pattern preference scan in UserProfileHook with a single Aho-Corasick DFA pass per sentence, eliminates the previous per-turn String allocations from re-lowercasing, and expands both the pattern coverage and the sentence delimiter set so more real-world preference statements are captured.

This is the same class of optimization as #2842 (prompt-injection RegexSet) and #2870 (routing/quality AC) — moving a hot-path "scan input against N curated patterns" loop from O(N) substring sweeps over a re-allocated lowercase buffer to a single byte-level DFA pass.

What changes

src/openhuman/learning/user_profile.rs:

DFA built once. PREFERENCE_DFA: LazyLock<AhoCorasick> compiled with ascii_case_insensitive(true) + MatchKind::LeftmostFirst. No to_lowercase() of the message or per-sentence trim/lowercase pass any more.
Word-boundary check. The byte immediately after each match must be non-ASCII-alphanumeric and at least one byte of trailing content must exist. This rejects:
- "I preferred X" (alphanumeric continuation — false positive against "i prefer")
- degenerate empty-tail fragments like "I prefer" (the residue of splitting "I prefer.") or dangling "I prefer:" — they carry no preference target and would just pollute the user_profile memory namespace with useless pref/i_prefer slugs.
Fallback path dropped. The previous lower.starts_with(...) post-loop rescue existed only because the substring sweep couldn't catch the "I prefer:X" shape. Proper word boundaries handle it in the main path.

PREFERENCE_PATTERNS: 20 → 28 entries:

Direct preference: i'd prefer, i would prefer, i'd rather, i dislike
Habit/instruction: please use
Identity/context: my pronouns, my preferred, call me, address me as

Trailing whitespace removed from every pattern (the word-boundary check now does that job).

SENTENCE_DELIMITERS: extended from ['.', '!', '\n'] to ['.', '!', '?', ';', '\n'] so questions and semicolon-joined clauses decompose correctly ("What's your view? I prefer Rust.", "OK; I prefer Rust."). : is intentionally not a delimiter so "My role: engineer" stays one sentence the "my role" pattern can match.

Cargo.toml: adds aho-corasick = "1.1" as a direct dependency (already pulled in transitively via regex).

Why this matters

UserProfileHook runs on every user turn as a PostTurnHook. The old code shape was:

let lower = message.to_lowercase();          // 1 String alloc
for sentence in message.split(...) {
    let sentence_lower = trimmed.to_lowercase(); // N String allocs
    for pattern in PREFERENCE_PATTERNS {          // 20 substring scans per sentence
        if sentence_lower.contains(pattern) { ... }
    }
}

For a 5-sentence user message that's 6 String allocations + 100 substring scans, every turn, on the agent hot path — plus you can't even add patterns cheaply because each one extends the inner loop. The new shape is one DFA, one byte-level pass per sentence, no per-call allocation until a match is emitted. Pattern count is no longer a cost dial.

The pattern expansion is the user-visible win: "I'd prefer concise responses", "Call me Alex", "My pronouns are they/them", "My preferred editor is Helix", "Please use snake_case" — all previously slipped through because they didn't match any of the 20 hard-coded openings. That's the kind of preference statement the memory system exists to capture.

The expanded delimiters (?, ;) close another silent gap: a leading question like "What's the timezone situation? My timezone is PST." previously bled the preamble into the preference sentence, which then either failed the length filter or stored noise alongside the actual preference.

Test plan

cargo fmt
cargo check --manifest-path Cargo.toml
cargo test -p openhuman --lib learning::user_profile — 14 passed
Pre-push hook (pnpm rust:check, compile, lint, lint:commands-tokens) — passed

15 unit tests total (was 5):

happy path (finds_patterns, handles_single_sentence_message)
negative path (handles_no_matches, ignores_short_sentences, rejects_bare_pattern_with_no_content_after)
cap enforcement (caps_at_max_per_turn)
word-boundary correctness — alphanumeric reject and non-alphanumeric accept
expanded delimiters (? and ;)
expanded patterns (catches_extended_patterns — one assertion per new pattern category so any future drop fails loudly)
Unicode safety (non_ascii_does_not_panic_or_falsely_match — Cyrillic, Polish diacritics, emoji prefix)
DFA smoke test (preference_dfa_compiles_and_has_expected_pattern_count — guards against silently dropping a pattern from the slice)
existing mocked-storage and on-turn-behaviour tests retained unchanged

Summary by CodeRabbit

Improvements
- Enhanced user preference extraction accuracy with stricter boundary detection to reduce false matches.
- Improved handling of edge cases and punctuation in preference recognition.
- Optimized performance for preference matching and extraction.

Refactor UserProfileHook::extract_preferences from a per-pattern str::contains sweep over re-lowercased sentences into a single case-insensitive Aho-Corasick DFA pass per sentence. ## What changes - Build one AhoCorasick DFA over all preference opening phrases at first use (LazyLock<AhoCorasick> with ascii_case_insensitive(true) + MatchKind::LeftmostFirst). - Drop the previous `let lower = message.to_lowercase()` plus per- sentence trimmed.to_lowercase() + per-pattern sentence_lower .contains(pattern) chain (6 String allocs + 5xN substring scans per 5-sentence turn) for a zero-alloc byte-level scan. - Add a sentence_has_preference() word-boundary check on the match end: the byte immediately after the match must be non-ASCII- alphanumeric AND at least one byte of trailing content must exist. This rejects both "I preferred X" (alphanumeric continuation = false match against "i prefer") and degenerate empty-tail residue like "I prefer" (the leftover from splitting "I prefer.") which carry no preference target. - Drop the post-loop lower.starts_with(...) fallback — the previous implementation needed it to rescue "I prefer:X" shapes that the primary sentence-split path couldn't catch. With proper word boundaries the main path handles it directly. ## Pattern coverage Expand PREFERENCE_PATTERNS from 20 → 28 entries (no breakage of existing matches): - Direct preference: i'd prefer, i would prefer, i'd rather, i dislike - Habit/instruction: please use - Identity/context: my pronouns, my preferred, call me, address me as Trailing spaces removed from every pattern (the word-boundary check now does that job). Expand sentence-delimiter set from ['.', '!', '\n'] to ['.', '!', '?', ';', '\n'] so "What's your view? I prefer Rust." and "OK; I prefer Rust." are decomposed correctly. ':' is intentionally NOT a delimiter so "My role: engineer" stays a single sentence that the "my role" pattern can match. ## Tests 15 unit tests total (was 5): - happy path (finds_patterns, handles_single_sentence_message) - negative path (handles_no_matches, ignores_short_sentences, rejects_bare_pattern_with_no_content_after) - cap enforcement (caps_at_max_per_turn) - word-boundary correctness (alphanumeric reject / non-alphanumeric accept) - expanded delimiters (? and ;) - expanded patterns (catches_extended_patterns) - Unicode safety (non_ascii_does_not_panic_or_falsely_match) - DFA smoke test (preference_dfa_compiles_and_has_expected_pattern_count) - existing mocked storage + on-turn behaviour tests retained Adds aho-corasick = "1.1" as a direct dependency (already pulled in transitively via regex).

coderabbitai · 2026-05-28T22:34:07Z

📝 Walkthrough

Walkthrough

This PR introduces the aho-corasick crate and refactors user preference extraction to use an allocation-free Aho-Corasick DFA with explicit sentence splitting, word-boundary validation, and per-turn preference caps, replacing a simpler substring-matching approach.

Changes

User Preference Extraction via Aho-Corasick

Layer / File(s)	Summary
Aho-Corasick dependency and pattern setup `Cargo.toml`, `src/openhuman/learning/user_profile.rs` (constants, patterns, DFA, boundary helper)	Adds `aho-corasick = "1.1"` dependency. Introduces `SENTENCE_DELIMITERS` (`. ! ? ; \n` excluding `:`), `MIN_SENTENCE_BYTES`, and `MAX_PREFERENCES_PER_TURN` constants. Expands `PREFERENCE_PATTERNS` and builds a lazy, ASCII case-insensitive `AhoCorasick` DFA. Adds `sentence_has_preference` helper that validates word boundaries at the byte level, rejecting matches at sentence end and rejecting alphanumeric continuations to prevent false positives.
Sentence-based extraction with word boundaries `src/openhuman/learning/user_profile.rs` (extract_preferences)	Reimplements `extract_preferences` to split input by `SENTENCE_DELIMITERS`, filter sentences by `MIN_SENTENCE_BYTES`, run DFA matching within each valid sentence, apply `sentence_has_preference` boundary checks, collect matches, and enforce `MAX_PREFERENCES_PER_TURN` cap.
Comprehensive test coverage `src/openhuman/learning/user_profile.rs` (test module)	Replaces and expands tests with cases for single-sentence matching, max-per-turn behavior, word-boundary correctness (alphanumeric rejection vs punctuation acceptance), end-of-sentence and bare-pattern fragment rejection, delimiter semantics (`?`/`;` inclusion, `:` exclusion), extended pattern coverage, Unicode safety without panics, and DFA pattern count validation.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 A pattern match with Aho-Corasick's flair,
Sentence by sentence, with boundary care,
No false positives lurking in the text,
Word-safe extraction—what's next?
Preferences leap from input's delicate snare.

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main change: replacing substring matching with Aho-Corasick DFA for preference extraction and expanding pattern coverage. It is specific, clear, and reflects the primary performance optimization and feature enhancement in the changeset.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

src/openhuman/learning/user_profile.rs (1)
551-559: ⚡ Quick win

Make the pattern-count assertion meaningful (pin expected length)

dfa.patterns_len() will always equal PREFERENCE_PATTERNS.len() because the DFA is built directly from that slice and aho-corasick’s patterns_len counts total compiled patterns (duplicates included). This won’t catch accidental add/drop. PREFERENCE_PATTERNS currently contains 29 entries, so pinning the literal makes the guard real; forcing LazyLock init remains a useful compile/panic smoke check.
♻️ Make the count check meaningful
     fn preference_dfa_compiles_and_has_expected_pattern_count() {
         let dfa = &*PREFERENCE_DFA;
-        assert_eq!(dfa.patterns_len(), PREFERENCE_PATTERNS.len());
+        // Pin the literal so an accidental add/drop of an entry is loud
+        // at CI time; `patterns_len()` alone is always == len() here.
+        assert_eq!(PREFERENCE_PATTERNS.len(), 29);
+        assert_eq!(dfa.patterns_len(), PREFERENCE_PATTERNS.len());
     }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/openhuman/learning/user_profile.rs` around lines 551 - 559, The test
preference_dfa_compiles_and_has_expected_pattern_count currently compares
dfa.patterns_len() to PREFERENCE_PATTERNS.len(), which is tautological; change
the assertion to pin the expected literal count (replace the second operand with
29, the current number of entries) so the test fails if entries are accidentally
added/removed, while still forcing LazyLock init via referencing PREFERENCE_DFA.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/openhuman/learning/user_profile.rs`:
- Around line 551-559: The test
preference_dfa_compiles_and_has_expected_pattern_count currently compares
dfa.patterns_len() to PREFERENCE_PATTERNS.len(), which is tautological; change
the assertion to pin the expected literal count (replace the second operand with
29, the current number of entries) so the test fails if entries are accidentally
added/removed, while still forcing LazyLock init via referencing PREFERENCE_DFA.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 6bc3a6c9-944a-44bb-8f17-e6fff39f1d58

📥 Commits

Reviewing files that changed from the base of the PR and between 972e327 and 8c486b4.

⛔ Files ignored due to path filters (1)

Cargo.lock is excluded by !**/*.lock

📒 Files selected for processing (2)

Cargo.toml
src/openhuman/learning/user_profile.rs

graycyrus

@mysma-9403 hey! the code looks good to me, but CI is still pending on several checks (Rust Core Tests, coverage runs, E2E Appium suites, frontend unit tests). once those are green, I'll come back and approve this. let me know if you need any help!

For reference, what I reviewed: the Aho-Corasick DFA swap is a clean, well-justified optimization — the word-boundary logic is correct, the LazyLock initialization is safe, LeftmostFirst + find_iter + .any() short-circuits correctly, and the 15-test suite covers the important edge cases (alphanumeric continuation rejection, bare-pattern empty-tail rejection, expanded delimiters, Unicode safety, DFA pattern count guard). The direct aho-corasick = "1.1" dep is the right call — it's already in the transitive tree via regex, BurntSushi's package, MIT/Unlicense, actively maintained.

mysma-9403 requested a review from a team May 28, 2026 22:33

coderabbitai Bot reviewed May 28, 2026

View reviewed changes

coderabbitai Bot previously approved these changes May 28, 2026

View reviewed changes

graycyrus reviewed May 28, 2026

View reviewed changes

senamakel previously approved these changes May 29, 2026

View reviewed changes

Merge branch 'main' into perf/user-profile-aho-corasick

95e6f6a

senamakel dismissed stale reviews from coderabbitai[bot] and themself via 95e6f6a May 29, 2026 03:53

senamakel merged commit dae1ef3 into tinyhumansai:main May 29, 2026
28 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(learning/user_profile): Aho-Corasick DFA for preference extraction + pattern coverage#2878

perf(learning/user_profile): Aho-Corasick DFA for preference extraction + pattern coverage#2878
senamakel merged 2 commits into
tinyhumansai:mainfrom
mysma-9403:perf/user-profile-aho-corasick

mysma-9403 commented May 28, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 28, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

graycyrus left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mysma-9403 commented May 28, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changes

Why this matters

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

graycyrus left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mysma-9403 commented May 28, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 28, 2026 •

edited

Loading