Skip to content

Improve skills based on CAO benchmark run 2 findings#89

Merged
pkosiec merged 3 commits into
mainfrom
pkosiec/benchmark-improvements
May 26, 2026
Merged

Improve skills based on CAO benchmark run 2 findings#89
pkosiec merged 3 commits into
mainfrom
pkosiec/benchmark-improvements

Conversation

@pkosiec
Copy link
Copy Markdown
Member

@pkosiec pkosiec commented May 25, 2026

Summary

Addresses 4 friction clusters identified in the CAO (Coding Agent Optimization) benchmark run 2 (May 21-22, 2026; Opus 4.7 + Codex 5.5, 25 tasks).

Changes

P2: OAuth auth guidance (Cluster A -- 16/25 tasks affected)

  • databricks-core/SKILL.md: Add troubleshooting row for "OAuth Token not supported for current auth type pat" directing to databricks auth login
  • databricks-apps/SKILL.md: Note that databricks apps logs requires OAuth in Post-Deploy Verification

P4: Model Serving endpoint readiness (Cluster F -- regression on both models)

  • databricks-model-serving/SKILL.md: Add "Endpoint Readiness" subsection with state transitions, provisioning times, and poll-before-query guidance. Enhance troubleshooting for NOT_READY state.

P5: Off-platform anti-pattern (Drizzle task -- Codex scores 35)

  • databricks-lakebase/references/off-platform.md: Add callout that off-platform apps are NOT Databricks Apps. Add "Running Locally" section with standard Node.js commands.
  • databricks-lakebase/SKILL.md: Strengthen off-platform reference description.

P6: Lakehouse Sync UI-only limitation (Cluster G)

  • databricks-lakebase/references/lakehouse-sync.md: Strengthen warning: "Do NOT attempt to automate this step."
  • databricks-lakebase/SKILL.md: Add "(UI-only)" annotation to Lakehouse Sync reference entry.

Context

The benchmark also identified these clusters that are addressed elsewhere:

Test plan

  • python3 scripts/skills.py validate passes

This pull request and its description were written by Isaac.

pkosiec added 2 commits May 25, 2026 18:24
Address 4 friction clusters from the May 2026 benchmark:
- OAuth auth: add PAT limitation troubleshooting and note on apps logs
- Model Serving: add endpoint readiness section with state transitions
- Off-platform: strengthen anti-pattern warning (not Databricks Apps)
- Lakehouse Sync: make UI-only limitation more prominent

Co-authored-by: Isaac
@pkosiec pkosiec force-pushed the pkosiec/benchmark-improvements branch from 7454094 to 861875f Compare May 25, 2026 17:11
@pkosiec pkosiec marked this pull request as ready for review May 25, 2026 17:13
@pkosiec pkosiec requested review from a team, lennartkats-db and simonfaltum as code owners May 25, 2026 17:13
@pkosiec pkosiec enabled auto-merge (squash) May 25, 2026 17:15
@keugenek
Copy link
Copy Markdown
Contributor

Latest dev apps_mcp_nightly runs (googfood) — clean baseline ahead of this bump:

Date Generate Edit Aggregate Run
May 25 5/5 ✅ 5/5 ✅ SUCCESS 858189680096134
May 24 5/5 ✅ 5/5 ✅ SUCCESS 323882092553585
May 23 5/5 ✅ 5/5 ✅ SUCCESS 356025760945950

(Prod and dogfood nightlies have unrelated infra issues that pre-date this PR — pointed those out separately.)

@pkosiec pkosiec requested a review from fjakobs as a code owner May 26, 2026 10:45
@pkosiec pkosiec merged commit fd56b87 into main May 26, 2026
@lennartkats-db lennartkats-db deleted the pkosiec/benchmark-improvements branch May 26, 2026 11:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants