feat(copilot): ground Ask Ontos in concept docs corpus (#280) by mvkonchits-db · Pull Request #472 · databrickslabs/ontos

mvkonchits-db · 2026-05-29T14:40:18Z

Summary

This is PR 1 of the Ask Ontos uplift (#280): a grounding & system-prompt foundation for the in-product copilot. PR 2 (Smart Copilot Insights, the original #280 scope) will land on top of this.

New docs/handbook/ corpus (14 files, ~3900 lines) — reference docs the LLM grounds in for "what is X?" / "how does Y work?" / "what's the difference between A and B?" questions. Covers roles & RBAC, ODPS/ODCS lifecycles, agreements, ontology + KG, data quality (incl. DQX flow), delivery modes vs methods, MCP, asset model, install + troubleshooting. Anonymized; aligned with the pitch deck + CUJ doc. Every major section has an explicit {#kebab-anchor} for stable citation.
New search_ontos_handbook tool (src/backend/src/tools/handbook.py) — grep-based search returning file.md#anchor citation URIs. Layout-agnostic path resolution (env-var override + walk-up handles both local-dev and deployed app layouts; degrades gracefully if corpus is absent).
System prompt (src/backend/src/tools/system_prompts.py) — tool-first policy for conceptual questions, refusal template, three-tier confidence labels (internal, stripped server-side), hidden citation discipline, vocabulary primer matched to the pitch deck + CUJ. Wires the LLM_SYSTEM_PROMPT env override that was previously dead code. get_system_prompt(settings, *, role, page_name, selected_entity, adoption_mode) accepts personalization slots for Phase 2/3 to fill.

What changed in `llm_search_manager.py`

Removed hardcoded ~140-line SYSTEM_PROMPT constant; now calls get_system_prompt(settings=...).
Strips internal grounding markers from user-visible response:  citation comments AND [Confirmed]/[Documented]/[Inferred] confidence labels. The model still emits both (so it stratifies its grounding), but they're filtered before reaching the chat UI. Captured in debug_info["internal_citations"] and debug_info["confidence_labels"] for audit.

Response to review (updates 2026-06-02 / 2026-06-03)

Four follow-up commits address @larsgeorge-db's review:

refactor(copilot): move system_prompts to tools/ — file was in controller/ but it's not a controller (no state, no business logic, just a function returning a string). Moved next to the MCP/LLM tools registry.
refactor(copilot): rename concepts → handbook to free up the "Concept" namespace — "Concept" is already a noun in Ontos for RDF/SKOS ontology terms (URIs in the knowledge graph). Overloading the same noun for the LLM grounding corpus caused confusion. Renamed: docs/concepts/ → docs/handbook/; tools/concepts.py → tools/handbook.py; search_ontos_concepts → search_ontos_handbook; ONTOS_CONCEPTS_DIR → ONTOS_HANDBOOK_DIR.
feat(copilot): instruct the LLM on multi-language handling — added a ## Language section to the system prompt: answer in the user's language, keep Ontos UI labels and concept names in English exactly as they appear in the app.
fix(copilot): replace undefined _DEFAULT_HANDBOOK_DIR in resolve-failure log — latent NameError on the corpus-not-found unhappy path (the log f-string referenced a symbol that was never defined). Replaced with the env-var constant we actually have.

Deferred to follow-up issue #489:

Encoding the corpus as RDF / SPARQL-queryable knowledge-graph extension instead of markdown (Lars' "rather encoded in the Ontos Ontology" suggestion). Interesting long-term direction; markdown is the right tool for the corpus's current size and authoring profile.
Embedding-based retrieval to replace grep (shared infrastructure with [PRD]: Ontology Term Mapping (Bulk Suggest & Apply) #469 ontology-term-mapping work).
Full multi-locale translation of the handbook corpus (deferred; the prompt instruction above is the minimum-viable handling).

Closes #280

This pull request and its description were written by Isaac.

Add 13 markdown files under docs/concepts/ that serve as the grounding corpus for the Ask Ontos copilot. Covers: - roles & RBAC + permission model - data product / data contract lifecycles - agreement workflow (workflow vs execution vs agreement) - ontology and knowledge-graph model, semantic linking (three-tier) - data quality + DQX integration end-to-end - delivery modes vs delivery methods (disambiguated) - MCP and Ask Ontos surfaces - asset model - personas quick-reference - end-to-end flows (bottom-up UC -> catalog, top-down ontology -> assets) Every major section carries an explicit {#kebab-anchor} so the copilot can cite via search_ontos_concepts in a follow-up commit. Citations are hidden from end-users in v1; the corpus is LLM grounding, not a user-facing docs site. Vocabulary aligned with the pitch deck + CUJ doc (ODPS v1.0.0, ODCS v3.1.0). Forward-compatibility softening applied for several in-flight PRs (versioning, Ontos admin decoupling, approver-role filter, etc.) without naming them. Co-authored-by: Isaac

Make the in-product copilot citation-anchor conceptual answers to the new docs/concepts/ corpus. - Add SearchOntosConceptsTool that walks docs/concepts/, parses sections by heading and {#anchor}, returns top-K excerpts ranked by title > anchor > body keyword frequency. Each match returns file, anchor, title, excerpt, source_uri (file.md#anchor). - Add 'concepts' query-classifier category in DEFAULT_CATEGORIES and ALWAYS_INCLUDED_CATEGORIES so the tool is offered on every conceptual question. - Extract hardcoded SYSTEM_PROMPT into a new controller/system_prompts.py module exposing get_system_prompt() with personalization slots (role, page_name, selected_entity, adoption_mode) for Phase 2/3 to fill. v1 ignores the slots. - Honor LLM_SYSTEM_PROMPT env override (previously defined in Settings but never consumed). - New default system prompt: vocabulary primer aligned with the pitch deck + CUJ doc, tool-first policy for conceptual questions, three-tier confidence labels ([Confirmed]/[Documented]/[Inferred]), hidden citation discipline, strict refusal template, out-of-scope deflection. Tests: - 13 unit tests for SearchOntosConceptsTool (empty query, known concept, multi-doc concept, no-match, anchor extraction) - 6 integration tests for /api/llm-search/chat with the new tool + system prompt - Full unit suite passes (1011/1011); no regressions Co-authored-by: Isaac

The previous resolution walked exactly 5 parents above concepts.py to find docs/concepts/. That assumed the local-dev layout (with src/ as a wrapper) and silently broke in deployed Databricks Apps where src/ is stripped (so the corpus lives 4 parents up, not 5). Replace with: - ONTOS_CONCEPTS_DIR env var override (explicit, takes precedence) - Walk-up search across parents 2..6 looking for docs/concepts/ - Graceful None on miss (tool still returns success=True, empty matches) Verified for both layouts: - Local: <ontos>/src/backend/src/tools/concepts.py -> finds at parents[4] - Deployed: <approot>/backend/src/tools/concepts.py -> finds at parents[3] Co-authored-by: Isaac

…onse The system prompt asks the model to anchor conceptual answers with hidden HTML-comment citations (e.g. ``) so reviewers can audit grounding without exposing them to end users. Most markdown renderers drop HTML comments on render, but the chat UI surfaces them as visible text — which is what live E2E confirmed. Add a server-side strip in LlmSearchManager: - `_CITATION_COMMENT_RE` matches `` (non-greedy) - `_strip_internal_citations` returns (cleaned_text, [refs]) so debug_info retains the citations for audit while the user-facing response is clean - Applied at the inner-loop final return; collapses any 3+ newlines created by the strip back to double Citations remain accessible via `debug_info["internal_citations"]` when the client sets `debug=True`. Co-authored-by: Isaac

Add a 14th file covering install (Marketplace vs Git), update procedures, maintenance, and common UI errors. 37 anchors so any specific error can be cited. Topics: - Distribution channels: Marketplace vs GitHub repo, when to choose which - First install: prerequisites, first-admin bootstrap, demo presets - Updates: Marketplace path, Git path, migration discipline (append-only, ≤32-char revision IDs), DB state vs code state - Maintenance: alembic at startup, role re-seeding (first-start-only), workspace sync direction (from src/), OAuth scope-change cookie gotcha, customer fork hygiene - UI errors users actually see: * Identity — Request role prompt, unexpected 403s, UC scope missing * Workflows — Cannot approve, grant_permissions failed (MANAGE required) * Database — Alembic version too long, Lakebase autoscale stuck, stale data after git revert * Deploy — Process did not start in 10 min, corpus not found 6 customer-voice "Common questions". Cross-references to roles-and-rbac, agreement-workflow, delivery-and-propagation, mcp-and-ask-ontos. No customer names, no internal ticket IDs. README.md updated to 14 files; verification footer bumped to 2026-05-29. Co-authored-by: Isaac

…ponse The labels [Confirmed]/[Documented]/[Inferred] were emitted user-visible per the v1 system prompt, but they expose grounding mechanics that don't belong in the surfaced answer. Treat them the same way as citation comments — emit them so the model still stratifies confidence and so reviewers can audit grounding, but strip server-side before returning. - Add `_CONFIDENCE_LABEL_RE` and extend `_strip_internal_citations` to a 3-tuple return (cleaned_text, citations, confidence_labels) - Surface both into debug_info (`internal_citations`, `confidence_labels`) so audit consumers can still see them via `debug=true` - Update system prompt to declare the labels internal/stripped (so the model knows the act of stratifying matters even though they're hidden) Co-authored-by: Isaac

The model was opening conceptual answers with a bolded restatement of the user's question (e.g. **What is a Team?** followed by the answer). That's redundant in the chat thread where the user already sees their own question above, and reads as noise. Update the Response format section to explicitly forbid: - restating, echoing, or rephrasing the question - bolded-question headers as openers - "Great question!" / "Let me explain..." fillers Begin with the answer directly. Co-authored-by: Isaac

larsgeorge-db · 2026-05-30T11:20:42Z

@@ -0,0 +1,97 @@
+# Ontos Concept Corpus (LLM Grounding)


What about the other supported languages? Is grounding in English (for fact gathering) good enough to answer in another language?

Added a ## Language section in 3c26d22: LLM answers in user's language, keeps Ontos UI labels in English. Full corpus translation deferred to #489.

larsgeorge-db · 2026-05-30T11:23:47Z

@@ -0,0 +1,201 @@
+"""
+System-prompt assembly for the Ask Ontos copilot.


If this is possibly shared with other parts of the app, should this not live under tools (or similar, like common or a more suitable module), that is, outside of controller? The MCP/LLM tools registry also lives there already.

Moved to src/backend/src/tools/system_prompts.py in c687f11.

larsgeorge-db · 2026-05-30T11:26:57Z

+# Ontos Concept Corpus (LLM Grounding)
+
+These documents are **internal grounding material for the Ask Ontos copilot**,
+not user-facing product documentation. They define the canonical vocabulary,


I wonder if this should be rather encoded in the actually Ontos Ontology, or rather, an RDF-based extension that adds this information to the knowledge graph and makes it available via SPARQL as well.

What iritates me is calling this "concepts", since we already use this for ontology terms.

Naming: renamed concepts → handbook across docs, tool, env var (8c07adf) — frees up Concept for the ontology-term meaning. RDF/SPARQL encoding of the corpus is a longer-term direction; tracked in #489 with the rationale for keeping markdown at current corpus size.

larsgeorge-db · 2026-05-30T11:29:25Z

+
+The corpus is treated as read-only at runtime; the tool walks the
+directory on every call (it's small — 13 files, ~100KB) and tokenizes
+the query for a simple title/anchor/body-frequency match. Intentionally


We may want to add this as a follow up task. We need embeddings sooner or later anyways, for example for #469

Filed as follow-up in #489 — will free-ride on the embedding infra from #469.

larsgeorge-db · 2026-05-30T11:29:56Z

+    # The matching tool — search_ontos_concepts — grounds the LLM in
+    # the curated docs/concepts/ corpus.
+    "concepts": [
+        "what is", "what's", "what are", "how does", "how do",


Same note as earlier... what about other user languages?

Same as the README thread: cheap fix is the ## Language prompt rule (3c26d22). Multi-locale classifier keywords would unlock better routing for non-English queries — tracked in #489.

system_prompts.py is not a controller — it's a pure templating function with no state, no business logic, no DB or manager dependencies. Move it next to the MCP/LLM tools registry it actually feeds, where the file structure already groups everything the copilot reaches for. Per Lars' review on PR #472. Co-authored-by: Isaac

…" namespace (#280) Ontos already uses "Concept" for an RDF/SKOS ontology entity — a node in the knowledge graph identified by an IRI, surfaced as an ontology class or glossary term. Overloading the same noun for the LLM grounding corpus caused real confusion: `tools/concepts.py` was unrelated to the `Concept` entity, `search_ontos_concepts` could mean either, and "concept corpus" / "concept docs" prose blurred the distinction throughout the codebase. This commit renames the LLM grounding corpus to "handbook" everywhere — `docs/concepts/` → `docs/handbook/`, `tools/concepts.py` → `tools/handbook.py`, the LLM-callable tool `search_ontos_concepts` → `search_ontos_handbook`, the `ONTOS_CONCEPTS_DIR` env var → `ONTOS_HANDBOOK_DIR`, and the `concepts` query-classifier category → `handbook`. "Concept" is now reserved for the ontology entity. Per Lars' review on PR #472. Co-authored-by: Isaac

The handbook corpus is English-only, but the Ontos UI ships in seven locales (English, German, Spanish, French, Italian, Japanese, Dutch). Today the system prompt gives the LLM no guidance on what to translate vs. keep in English — so non-English answers end up translating product nouns ("Datenprodukt", "Lieferbar") that don't exist anywhere in the UI. Add a `## Language` section that tells the model: answer in the user's language, but keep Ontos product nouns and UI labels in English exactly as they appear in the app. Per Lars' review on PR #472. Co-authored-by: Isaac

…ure log The corpus-not-found warning referenced a symbol that was never defined (carried over from the original concepts.py implementation; the constant was removed but the log f-string still referenced it). This was a latent NameError on the unhappy path — if the resolver ever returned None the log emit itself would crash before the empty-matches ToolResult could be returned. Replace with the env-var constant we actually have, so the warning tells operators exactly which knobs were tried. Co-authored-by: Isaac

mvkonchits-db added 7 commits May 29, 2026 14:02

mvkonchits-db requested a review from a team as a code owner May 29, 2026 14:40

larsgeorge-db reviewed May 30, 2026

View reviewed changes

mvkonchits-db added 4 commits June 2, 2026 11:07

This was referenced Jun 3, 2026

feat(copilot): Ask Ontos uplift — personalization, smart prompts, audience contract (#280) #488

Draft

[Improvement]: Ask Ontos handbook — deferred follow-ups from PR #472 review #489

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(copilot): ground Ask Ontos in concept docs corpus (#280)#472

feat(copilot): ground Ask Ontos in concept docs corpus (#280)#472
mvkonchits-db wants to merge 11 commits into
mainfrom
feature/ask-ontos-uplift-pr1

mvkonchits-db commented May 29, 2026 •

edited

Loading

Uh oh!

larsgeorge-db May 30, 2026

Uh oh!

mvkonchits-db Jun 3, 2026

Uh oh!

larsgeorge-db May 30, 2026

Uh oh!

mvkonchits-db Jun 3, 2026

Uh oh!

larsgeorge-db May 30, 2026

Uh oh!

mvkonchits-db Jun 3, 2026

Uh oh!

larsgeorge-db May 30, 2026

Uh oh!

mvkonchits-db Jun 3, 2026

Uh oh!

larsgeorge-db May 30, 2026

Uh oh!

mvkonchits-db Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -0,0 +1,201 @@
		"""
		System-prompt assembly for the Ask Ontos copilot.

Conversation

mvkonchits-db commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed in llm_search_manager.py

Response to review (updates 2026-06-02 / 2026-06-03)

Closes #280

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mvkonchits-db commented May 29, 2026 •

edited

Loading

What changed in `llm_search_manager.py`