ci: run e2e quick-mode suite on every PR by zpzjzj · Pull Request #8 · alibaba/skill-up

zpzjzj · 2026-05-14T09:12:30Z

Summary

Add an E2E (quick) job to .github/workflows/ci.yml that runs go test -tags e2e -timeout 600s ./e2e.
Slow LLM-dependent cases stay gated behind SKILL_UP_FULL_E2E; the rest cover the CLI, runner, agent wiring (incl. the fake-codex workspace-diff pipeline), and contract tests.
No production code touched — CI surface only.

Test plan

go test -tags e2e -timeout 300s ./e2e locally with a clean HOME and PATH: pass.
GitHub Actions E2E (quick) job green on this PR.

🤖 Generated with Claude Code

Run the build-tag-gated e2e suite (./e2e with -tags e2e) on every PR. Slow LLM-dependent tests stay gated behind SKILL_UP_FULL_E2E and are skipped in CI; the remaining cases exercise the CLI, runner, agent wiring, and workspace-diff pipeline against fake/local fixtures.

Previously the e2e job only ran the build-tag-gated tests with no engine CLIs on PATH, so every claude / codex / qodercli case short-circuited via skipIf* and the suite finished in ~3s — not actually exercising the none-runtime pipeline. Install the three engine CLIs explicitly on the runner so installation checks, --version probes, and none-runtime wiring tests run against real binaries. Tests that need a live model stay gated behind SKILL_UP_FULL_E2E (unchanged) since CI has no secrets.

Wire DASHSCOPE_API_KEY and QODER_ACCESS_TOKEN into the e2e job so the LLM-dependent tests (TestAgent_ClaudeCode_*, CodeStatsBenchmark, etc.) actually run instead of skipping. - DASHSCOPE_BASE_URL / ANTHROPIC_BASE_URL → DashScope Anthropic-compat endpoint, since claude-code is the engine under test for full-mode cases. - OPENAI_BASE_URL → DashScope OpenAI-compat endpoint for codex. - QODER_PERSONAL_ACCESS_TOKEN maps from the repo secret QODER_ACCESS_TOKEN. Bumped the test timeout to 30m to accommodate the real-model paths.

examples/code-stats ships both eval.yaml and evals.json. Auto-detection resolves eval.yaml first, so the previous "Auto-detected Anthropic evals.json" assertion could only pass when evals.json was the sole eval descriptor — which has not been true in this repo. The test was effectively dead because SKILL_UP_FULL_E2E was never set in CI; enabling full mode surfaced the stale assertion.

TestAgent_Codex_NoneRuntime hardcodes model gpt-5.4 in its eval.yaml. Against the DashScope OpenAI-compat endpoint that model is unknown, so codex retries until the case timeout and the test only "passes" because its assertions are lenient (artifacts exist) — codex itself never returns a real response. The provider-env override path (lookupProviderEnv → setResolvedValue) treats OPENAI_MODEL as authoritative when present, so setting it here swaps the runtime model to qwen3.6-plus without touching the test. DASHSCOPE_MODEL is added for symmetry with the dashscope-provider tests.

internal/cli/import.go pins the judge to "anthropic/claude-sonnet-4-6", which would route to real Anthropic Claude even with the DashScope base URL set. ANTHROPIC_MODEL takes precedence through lookupProviderEnv, so every anthropic-provider hop in e2e now lands on qwen3.6-plus instead. Combined with OPENAI_MODEL and DASHSCOPE_MODEL, all model-bearing paths in the e2e suite use qwen3.6-plus.

Previously codex's runProviderConfig short-circuited on provider=="openai" and only handed buildCodexRunCmd a bare BaseURL, which downstream translated into "-c openai_base_url=...". Codex's bundled "openai" model_provider config wins over that key, so the env-supplied base URL was silently ignored and codex always dialed api.openai.com — surfacing in CI as TestAgent_Codex_NoneRuntime hitting context deadline against DashScope. Treat provider=openai with a non-empty BaseURL the same way we already treat custom providers: emit a full model_provider="openai" + model_providers.openai.{name,base_url,env_key,wire_api} override so codex routes traffic to the configured endpoint. When BaseURL is unset the legacy built-in openai provider is used unchanged.

…der key The previous attempt reused the literal "openai" provider name when emitting -c flags. Codex 0.80.x ships its own builtin openai provider config under that key and won't reliably merge overrides onto it — it keeps dialing api.openai.com regardless. Use a separate "openai-override" key so codex constructs a fresh provider definition from the -c flags alone and routes traffic to the configured BaseURL.

…t naming - ci.yml: gate every e2e step on a "Check secret availability" probe so fork PRs (where secrets.DASHSCOPE_API_KEY is empty) skip cleanly without burning CI minutes on doomed installs. Also relax the ANTHROPIC_MODEL comment so it no longer name-checks a specific file path that might rot. - codex.go: rename codexOpenAIOverrideProvider to "skill-up-openai" so the string that lands in real codex command lines is self-attributing. - codex_test.go: add an integration-style assertion on buildCodexRunCmd output to lock in the exact -c flag set emitted for provider=openai + custom BaseURL (the existing BaseURL test only covered the legacy fallback path). - cli_test.go: rename TestCLI_RunAutoWithExamples → TestCLI_RunAutoPrefersEvalYAML to reflect what it actually verifies; keep both eval.yaml and evals.json in the example fixture so each detector path stays exercised.

…unparam golangci-lint's unparam linter flagged buildCodexRunCmd's instruction parameter because all three existing tests pass the literal "say hi". The freshly added skill-up-openai golden test reused the same string, which kept the warning load-bearing. Use a different instruction so unparam observes variance and the lint passes.

TestAgent_Codex_NoneRuntime only asserts workspace plumbing exists, so codex failing internally (401, context deadline, model unknown) shows up as a still-PASS test with one cryptic log line. When this happens in CI we currently have no way to diagnose without re-running locally. Surface the run artifacts — stdout.json, response.md, transcript, result.json, grading.json, stderr.txt — into the test log (each capped at 8KB) whenever codex exits non-zero. Test still passes; failure logs now contain enough to tell apart 'codex CLI broke', 'auth rejected', and 'upstream LLM timed out'.

This reverts commit abb7cc8.

…d from CI When a real-LLM e2e test (TestAgent_Codex_NoneRuntime, TestAgent_QoderCLI_NoneRuntime_FullRun) sees the agent return non-zero the test still PASSes — its assertions only cover workspace plumbing, not LLM success. We had no way to inspect the actual codex/qoder artifacts (stdout.json, response.md, transcript, result.json) after a flaky CI run because everything lives under t.TempDir(). Add a preserveWorkspaceArtifacts(t, workspaceDir) helper that, when SKILL_UP_E2E_ARTIFACT_DIR is set, registers a t.Cleanup ordered before t.TempDir's own cleanup to copy the workspace to a stable path. Hook the two lenient real-LLM tests and have CI: - set SKILL_UP_E2E_ARTIFACT_DIR=$GITHUB_WORKSPACE/e2e-artifacts - upload-artifact@v5 the directory on always() so failing or merely suspicious runs surface their session output without re-running.

jwx0925 · 2026-05-14T12:49:55Z

@codex review

zpzjzj added 7 commits May 14, 2026 17:08

zpzjzj requested review from lbfsc and lijunfeng722 May 14, 2026 10:12

zpzjzj added 6 commits May 14, 2026 18:17

Revert "test(agent/codex): dump run artifacts on codex non-zero exit"

c8d7fed

This reverts commit abb7cc8.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: run e2e quick-mode suite on every PR#8

ci: run e2e quick-mode suite on every PR#8
zpzjzj wants to merge 13 commits into
mainfrom
ci/add-e2e-job

zpzjzj commented May 14, 2026

Uh oh!

jwx0925 commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zpzjzj commented May 14, 2026

Summary

Test plan

Uh oh!

jwx0925 commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants