Skip to content

Conversation

Ayush7614
Copy link
Contributor

Summary

This PR adds a new E2B + Promptfoo guide (e2b-code-eval/) that demonstrates secure execution of LLM-generated Python code inside the [E2B sandbox].

Copy link
Contributor

coderabbitai bot commented Sep 11, 2025

📝 Walkthrough

Walkthrough

  • Adds examples/grok-4-vs-gpt-5/README.md describing a Promptfoo example that benchmarks and red-teams xAI Grok-4 against OpenAI GPT-5 via OpenRouter, including setup, environment variables, commands, and evaluation scope.
  • Adds examples/grok-4-vs-gpt-5/promptfooconfig.yaml configuring two providers, 20 prompts (15 benchmark, 5 red-team), scorers (latency, output length, semantic quality, robustness with grading), and output options.
  • Adds site/docs/guides/evaluate-e2b.md with an end-to-end guide and supporting code: validate_and_run_code_e2b.py, metrics.py, report.py, swe_runner.py, promptfooconfig.yaml, and prompt templates. Introduces functions for code safety checks, sandboxed execution, metrics, reporting, and SWE-style workflows.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Pre-merge checks (3 passed)

✅ Passed checks (3 passed)
Check name Status Explanation
Title Check ✅ Passed The title "feat(e2b-example): add e2b-code-eval guide (promptfoo + e2b sandbox)" concisely and accurately summarizes the primary change — adding an e2b-code-eval guide integrating Promptfoo with the E2B sandbox — and aligns with the provided raw_summary and PR objectives. It is specific, on-topic, and suitable for commit history.
Description Check ✅ Passed The PR description states that it adds a new E2B + Promptfoo guide at e2b-code-eval/ demonstrating secure execution of LLM-generated Python code in the E2B sandbox, which directly matches the changes summarized in the raw_summary and the PR objectives. The description is brief but clearly related to the changeset.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

  • Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
  • Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.

✨ Finishing touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (16)
examples/grok-4-vs-gpt-5/promptfooconfig.yaml (3)

16-100: Update prompt text to match chosen models

Several tasks reference “GPT-5”. If switching to o3-mini/4o-mini, align task names/questions accordingly to avoid confusion in outputs and scorers.

Example edits:

-    Compare Grok-4 and GPT-5 in 3 bullet points based on speed, accuracy, and reasoning depth.
+    Compare Grok-4 and OpenAI o3-mini in 3 bullet points based on speed, accuracy, and reasoning depth.

112-121: Use a stable, cost-effective judge model

If keeping model-graded scoring, pick a verified judge (e.g., openai:gpt-4o-mini or anthropic:claude-3-7-sonnet-20250219). Avoid unverified “gpt-5”.

Apply:

-    provider: openrouter:openai/gpt-5
+    provider: openai:gpt-4o-mini

1-127: Optional: Add a minimal tests block for basic sanity metrics

Even if you keep the single mega-prompt, adding a trivial test ensures consistent runs and enables future references.

Example:

+tests:
+  - vars: {}
examples/grok-4-vs-gpt-5/README.md (3)

1-1: PR title must use docs: prefix for site/ and examples/ only changes

Project rule: PRs touching only site/ or examples/ should be titled with docs:. Current: “feat(e2b-example)”.

Want me to retitle to “docs(e2b-example): add e2b-code-eval guide (promptfoo + e2b sandbox)”?


18-26: Document env vars with purpose and acquisition; include .env sample

Add brief explanations and a .env example to satisfy example README requirements.

Proposed insertion after prerequisites:

+### Environment variables
+- OPENROUTER_API_KEY: OpenRouter access token used to call both models via OpenRouter.
+  - Get one from your OpenRouter account dashboard.
+
+Example .env:
+```env
+OPENROUTER_API_KEY=sk-or-xxxxxxxx
+```

16-17: Optional: Note pricing/usage limits for OpenRouter

Briefly mention rate limits/costs and link to provider docs.

site/docs/guides/evaluate-e2b.md (10)

57-75: Headings, not emphasis, for subsections; remove trailing punctuation in headings

Fix markdownlint MD036/MD026.

-**Python installed**
+### Python installed

-**Node.js + npm installed**
+### Node.js + npm installed

794-799: Avoid bare URLs; make it a link

Fix MD034.

-This launches a local server (for example, at http://localhost:15500) and asks:
+This launches a local server (for example, at [http://localhost:15500](http://localhost:15500)) and asks:

801-801: Remove trailing colon in heading

Fix MD026.

-## What you see in the Promptfoo Web Viewer:
+## What you see in the Promptfoo Web Viewer

904-910: Remove trailing colons in headings

Fix MD026.

-### Option 1 (CLI):
+### Option 1 (CLI)
-### Option 2 (Web UI):
+### Option 2 (Web UI)

696-717: These are prompt instructions; use text, not bash; avoid nested triple backticks breakage

Switch to text and use quadruple backticks outside to safely include python inside.

-```bash
+````text
 You are a careful Python engineer. Produce only the Python function definition (no explanation) wrapped in triple backticks.
 
 Requirements:
 - Use the exact function name: {{function_name}}
 ...
 ```python
 def sum_integers(nums):
     return sum(nums)

-```
+````


---

`722-726`: **Not bash; use text for prompt template**


```diff
-```bash
+```text
 You are a Python test author. Given the function below, write 3 short `assert` tests (not using pytest) that exercise edge cases and typical cases. Return only the tests (no explanations). Function:
 
 {{function_code}}

---

`737-751`: **Not bash; use text for prompt template**


```diff
-```bash
+```text
 You are a careful engineer. Given a failing test log and repository file contents, produce a minimal unified diff patch that fixes the failing test.
 ...
 {{relevant_files}}

---

`2-5`: **Front matter description: prefer “eval” phrasing**

Minor consistency tweak with style guide.

```diff
-description: Hands-on tutorial (September 2025) on evaluating LLM code generation using Promptfoo with the e2b sandbox—includes setup, YAML tests, execution in a secure environment, and metrics reporting.
+description: Hands-on tutorial (September 2025) on LLM codegen eval using Promptfoo with the e2b sandbox—includes setup, YAML tests, secure execution, and metrics reporting.

765-771: Update narrative to current model name

The text says GPT-4.1; after changing the YAML to o3-mini, align the prose.

-- It calls OpenAI GPT-4.1 (or your chosen provider) with the problem prompts from code_generation_prompt_fs.txt.
+- It calls OpenAI o3-mini (or your chosen provider) with the problem prompts from code_generation_prompt_fs.txt.

939-961: Add a See Also section (required)

Docs pages should end sections with See Also/Related Concepts.

-## Conclusion
+## Conclusion
 ...
 - Automate daily red team runs with GitHub Actions.
+
+## See Also
+- Promptfoo eval CLI: https://www.promptfoo.dev/docs/usage/command-line/
+- Promptfoo red team: https://www.promptfoo.dev/docs/red-team/overview/
+- Providers and models: https://www.promptfoo.dev/docs/providers/overview/
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fc99240 and bfcc52a.

📒 Files selected for processing (3)
  • examples/grok-4-vs-gpt-5/README.md (1 hunks)
  • examples/grok-4-vs-gpt-5/promptfooconfig.yaml (1 hunks)
  • site/docs/guides/evaluate-e2b.md (1 hunks)
🧰 Additional context used
📓 Path-based instructions (8)
examples/*/promptfooconfig.yaml

📄 CodeRabbit inference engine (.cursor/rules/examples.mdc)

examples/*/promptfooconfig.yaml: Include a working promptfooconfig.yaml (or equivalent) file in each example directory
Always include the YAML schema reference at the top of configuration files: '# yaml-language-server: $schema=https://promptfoo.dev/config-schema.json'
Follow the specified field order in all configuration files: description, env (optional), prompts, providers, defaultTest (optional), scenarios (optional), tests
Ensure all configuration files pass YAML lint validation
When referencing external files in configuration, always use the 'file://' prefix
Always use the latest model versions available in 2025 in configuration files
For OpenAI, prefer models like 'openai:o3-mini' and 'openai:gpt-4o-mini' in configuration files
For Anthropic, prefer models like 'anthropic:claude-3-7-sonnet-20250219' in configuration files
For open-source models, use the latest versions available (e.g., latest Llama) in configuration files
Include a mix of providers when comparing model performance in configuration files
When demonstrating specialized capabilities (vision, audio, etc.), use models that support those features in configuration files
Format configuration files consistently
When creating examples for specific providers, always use the latest available model versions for that provider in configuration files
Update model versions when new ones become available in configuration files

Files:

  • examples/grok-4-vs-gpt-5/promptfooconfig.yaml
examples/*/{README.md,promptfooconfig.yaml}

📄 CodeRabbit inference engine (.cursor/rules/examples.mdc)

Include placeholder values for secrets/credentials in the README or configuration files

Files:

  • examples/grok-4-vs-gpt-5/promptfooconfig.yaml
  • examples/grok-4-vs-gpt-5/README.md
{site/**,examples/**}

📄 CodeRabbit inference engine (.cursor/rules/gh-cli-workflow.mdc)

Any pull request that only touches files in 'site/' or 'examples/' directories must use the 'docs:' prefix in the PR title, not 'feat:' or 'fix:'

Files:

  • examples/grok-4-vs-gpt-5/promptfooconfig.yaml
  • examples/grok-4-vs-gpt-5/README.md
  • site/docs/guides/evaluate-e2b.md
examples/**

📄 CodeRabbit inference engine (.cursor/rules/gh-cli-workflow.mdc)

When modifying examples, update existing files in 'examples/' instead of adding new ones (e.g., replace outdated model IDs rather than introducing new example files)

Place examples under examples/

Files:

  • examples/grok-4-vs-gpt-5/promptfooconfig.yaml
  • examples/grok-4-vs-gpt-5/README.md
examples/*/README.md

📄 CodeRabbit inference engine (.cursor/rules/examples.mdc)

examples/*/README.md: The README.md must begin with the folder name as an H1 heading
Every example README must include instructions on how to run it with 'npx promptfoo@latest init --example example-name'
Include a comprehensive README.md that explains the purpose, prerequisites, instructions, and expected outputs for the example
Document any model-specific capabilities or limitations in examples
Clearly list all required environment variables at the beginning of the README
For each environment variable, explain its purpose, how to obtain it, and any default values or constraints in the README
Include a sample .env file or instructions when multiple environment variables are needed in the README
Document any required API keys or credentials in the README
Provide instructions for cleaning up resources after running the example in the README
When creating examples for specific providers, explain any provider-specific configuration in the README
When creating examples for specific providers, document required environment variables in the README
When creating examples for specific providers, include information about pricing or usage limits in the README
When creating examples for specific providers, highlight unique features or capabilities in the README
When creating examples for specific providers, compare to similar providers where appropriate in the README

Files:

  • examples/grok-4-vs-gpt-5/README.md
examples/**/README.md

📄 CodeRabbit inference engine (CLAUDE.md)

Each example should include a clear README.md

Files:

  • examples/grok-4-vs-gpt-5/README.md
site/docs/**/*.md

📄 CodeRabbit inference engine (.cursor/rules/docusaurus.mdc)

site/docs/**/*.md: Prioritize minimal edits when updating existing documentation; avoid creating entirely new sections or rewriting substantial portions; focus edits on improving grammar, spelling, clarity, fixing typos, and structural improvements where needed; do not modify existing headings (h1, h2, h3, etc.) as they are often linked externally.
Structure content to reveal information progressively: begin with essential actions and information, then provide deeper context as necessary; organize information from most important to least important.
Use action-oriented language: clearly outline actionable steps users should take, use concise and direct language, prefer active voice over passive voice, and use imperative mood for instructions.
Use 'eval' instead of 'evaluation' in all documentation; when referring to command line usage, use 'npx promptfoo eval' rather than 'npx promptfoo evaluation'; maintain consistency with this terminology across all examples, code blocks, and explanations.
The project name can be written as either 'Promptfoo' (capitalized) or 'promptfoo' (lowercase) depending on context: use 'Promptfoo' at the beginning of sentences or in headings, and 'promptfoo' in code examples, terminal commands, or when referring to the package name; be consistent with the chosen capitalization within each document or section.
Each markdown documentation file must include required front matter fields: 'title' (the page title shown in search results and browser tabs) and 'description' (a concise summary of the page content, ideally 150-160 characters).
Only add a title attribute to code blocks that represent complete, runnable files; do not add titles to code fragments, partial examples, or snippets that aren't meant to be used as standalone files; this applies to all code blocks regardless of language.
Use special comment directives to highlight specific lines in code blocks: 'highlight-next-line' highlights the line immediately after the comment, 'highligh...

Files:

  • site/docs/guides/evaluate-e2b.md
site/**

📄 CodeRabbit inference engine (.cursor/rules/gh-cli-workflow.mdc)

If the change is a feature, update the relevant documentation under 'site/'

Files:

  • site/docs/guides/evaluate-e2b.md
🧠 Learnings (11)
📓 Common learnings
Learnt from: CR
PR: promptfoo/promptfoo#0
File: .cursor/rules/examples.mdc:0-0
Timestamp: 2025-07-18T17:25:38.445Z
Learning: Applies to examples/*/promptfooconfig*.yaml : Include a mix of providers when comparing model performance in configuration files
Learnt from: CR
PR: promptfoo/promptfoo#0
File: .cursor/rules/examples.mdc:0-0
Timestamp: 2025-07-18T17:25:38.444Z
Learning: Applies to examples/*/promptfooconfig*.yaml : For trivial test cases in configuration, make them quirky and fun to increase engagement
📚 Learning: 2025-07-18T17:25:38.445Z
Learnt from: CR
PR: promptfoo/promptfoo#0
File: .cursor/rules/examples.mdc:0-0
Timestamp: 2025-07-18T17:25:38.445Z
Learning: Applies to examples/*/promptfooconfig*.yaml : Include a mix of providers when comparing model performance in configuration files

Applied to files:

  • examples/grok-4-vs-gpt-5/promptfooconfig.yaml
📚 Learning: 2025-07-18T17:25:46.665Z
Learnt from: CR
PR: promptfoo/promptfoo#0
File: .cursor/rules/examples.mdc:0-0
Timestamp: 2025-07-18T17:25:46.665Z
Learning: Applies to examples/*/promptfooconfig.yaml : Include a mix of providers when comparing model performance in configuration files

Applied to files:

  • examples/grok-4-vs-gpt-5/promptfooconfig.yaml
📚 Learning: 2025-07-18T17:25:38.444Z
Learnt from: CR
PR: promptfoo/promptfoo#0
File: .cursor/rules/examples.mdc:0-0
Timestamp: 2025-07-18T17:25:38.444Z
Learning: Applies to examples/*/promptfooconfig*.yaml : Include a working promptfooconfig.yaml (or equivalent) file in each example

Applied to files:

  • examples/grok-4-vs-gpt-5/promptfooconfig.yaml
📚 Learning: 2025-07-18T17:25:38.444Z
Learnt from: CR
PR: promptfoo/promptfoo#0
File: .cursor/rules/examples.mdc:0-0
Timestamp: 2025-07-18T17:25:38.444Z
Learning: Applies to examples/*/promptfooconfig*.yaml : For trivial test cases in configuration, make them quirky and fun to increase engagement

Applied to files:

  • examples/grok-4-vs-gpt-5/promptfooconfig.yaml
📚 Learning: 2025-07-18T17:25:46.665Z
Learnt from: CR
PR: promptfoo/promptfoo#0
File: .cursor/rules/examples.mdc:0-0
Timestamp: 2025-07-18T17:25:46.665Z
Learning: Applies to examples/*/promptfooconfig.yaml : Include a working promptfooconfig.yaml (or equivalent) file in each example directory

Applied to files:

  • examples/grok-4-vs-gpt-5/promptfooconfig.yaml
📚 Learning: 2025-07-18T17:25:46.665Z
Learnt from: CR
PR: promptfoo/promptfoo#0
File: .cursor/rules/examples.mdc:0-0
Timestamp: 2025-07-18T17:25:46.665Z
Learning: Applies to examples/*/promptfooconfig.yaml : Format configuration files consistently

Applied to files:

  • examples/grok-4-vs-gpt-5/promptfooconfig.yaml
📚 Learning: 2025-07-18T17:25:38.445Z
Learnt from: CR
PR: promptfoo/promptfoo#0
File: .cursor/rules/examples.mdc:0-0
Timestamp: 2025-07-18T17:25:38.445Z
Learning: Applies to examples/*/promptfooconfig*.yaml : Format configuration files consistently

Applied to files:

  • examples/grok-4-vs-gpt-5/promptfooconfig.yaml
📚 Learning: 2025-07-18T17:25:38.444Z
Learnt from: CR
PR: promptfoo/promptfoo#0
File: .cursor/rules/examples.mdc:0-0
Timestamp: 2025-07-18T17:25:38.444Z
Learning: Applies to examples/*/promptfooconfig*.yaml : Follow the specific field order in all configuration files: description, env (optional), prompts, providers, defaultTest (optional), scenarios (optional), tests

Applied to files:

  • examples/grok-4-vs-gpt-5/promptfooconfig.yaml
📚 Learning: 2025-07-18T17:25:38.445Z
Learnt from: CR
PR: promptfoo/promptfoo#0
File: .cursor/rules/examples.mdc:0-0
Timestamp: 2025-07-18T17:25:38.445Z
Learning: Applies to examples/*/promptfooconfig*.yaml : When demonstrating specialized capabilities (vision, audio, etc.), use models that support those features in configuration files

Applied to files:

  • examples/grok-4-vs-gpt-5/promptfooconfig.yaml
📚 Learning: 2025-07-18T17:25:38.445Z
Learnt from: CR
PR: promptfoo/promptfoo#0
File: .cursor/rules/examples.mdc:0-0
Timestamp: 2025-07-18T17:25:38.445Z
Learning: Applies to examples/*/promptfooconfig*.yaml : Update model versions when new ones become available in configuration files

Applied to files:

  • examples/grok-4-vs-gpt-5/promptfooconfig.yaml
🪛 markdownlint-cli2 (0.17.2)
site/docs/guides/evaluate-e2b.md

57-57: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


64-64: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


794-794: Bare URL used

(MD034, no-bare-urls)


801-801: Trailing punctuation in heading
Punctuation: ':'

(MD026, no-trailing-punctuation)


904-904: Trailing punctuation in heading
Punctuation: ':'

(MD026, no-trailing-punctuation)


910-910: Trailing punctuation in heading
Punctuation: ':'

(MD026, no-trailing-punctuation)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (17)
  • GitHub Check: Test on Node 22.x and windows-latest
  • GitHub Check: Test on Node 24.x and windows-latest
  • GitHub Check: Build on Node 24.x
  • GitHub Check: Build on Node 22.x
  • GitHub Check: Build on Node 20.x
  • GitHub Check: Run Integration Tests
  • GitHub Check: Test on Node 24.x and macOS-latest
  • GitHub Check: Redteam
  • GitHub Check: Test on Node 24.x and ubuntu-latest
  • GitHub Check: Test on Node 20.x and windows-latest
  • GitHub Check: Test on Node 20.x and macOS-latest
  • GitHub Check: Test on Node 22.x and macOS-latest
  • GitHub Check: Generate Assets
  • GitHub Check: Build Docs
  • GitHub Check: webui tests
  • GitHub Check: Test on Node 20.x and ubuntu-latest
  • GitHub Check: Style Check
🔇 Additional comments (1)
examples/grok-4-vs-gpt-5/promptfooconfig.yaml (1)

11-15: openrouter:openai/gpt-5 is valid — don't replace it unless you mean to switch providers

  • openrouter:openai/gpt-5 is a valid OpenRouter model slug (as of Sep 11, 2025); keep it if you intend to route via OpenRouter.
  • If you want Promptfoo's OpenAI provider instead, use openai:o3-mini (or openai:responses:o3-mini) or openai:chat:gpt-4.1-mini — gpt-4o-mini was replaced by GPT‑4.1‑mini on May 14, 2025.
    File: examples/grok-4-vs-gpt-5/promptfooconfig.yaml lines 11–15 — no change required unless switching providers.

Likely an incorrect or invalid review comment.

Comment on lines 5 to 16
providers:
- id: openrouter:x-ai/grok-4
config:
temperature: 0.7
max_tokens: 4000

- id: openrouter:openai/gpt-5
config:
temperature: 0.7
max_tokens: 4000

prompts:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Reorder fields to match required schema order

Required: description, env (optional), prompts, providers, defaultTest (optional), scenarios (optional), tests. Prompts must come before providers.

Apply:

 description: 'Ultimate Promptfoo Benchmark + Red Teaming: Grok-4 vs GPT-5'

-providers:
-  - id: openrouter:x-ai/grok-4
-    config:
-      temperature: 0.7
-      max_tokens: 4000
-
-  - id: openrouter:openai/gpt-5
-    config:
-      temperature: 0.7
-      max_tokens: 4000
-
 prompts:
   - |
     You are an advanced AI assistant. Complete the following 20 tasks carefully.
     ...
+
+providers:
+  - id: openrouter:x-ai/grok-4
+    config:
+      temperature: 0.7
+      max_tokens: 4000
+
+  - id: openai:o3-mini
+    config:
+      temperature: 0.7
+      max_tokens: 4000
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
providers:
- id: openrouter:x-ai/grok-4
config:
temperature: 0.7
max_tokens: 4000
- id: openrouter:openai/gpt-5
config:
temperature: 0.7
max_tokens: 4000
prompts:
description: 'Ultimate Promptfoo Benchmark + Red Teaming: Grok-4 vs GPT-5'
prompts:
- |
You are an advanced AI assistant. Complete the following 20 tasks carefully.
...
providers:
- id: openrouter:x-ai/grok-4
config:
temperature: 0.7
max_tokens: 4000
- id: openai:o3-mini
config:
temperature: 0.7
max_tokens: 4000
🤖 Prompt for AI Agents
In examples/grok-4-vs-gpt-5/promptfooconfig.yaml around lines 5 to 16, the YAML
fields are out of the required schema order (description, env, prompts,
providers, defaultTest, scenarios, tests); specifically prompts must appear
before providers. Update the file to place a description (if missing) and/or env
at the top as needed, move the prompts section so it appears before the
providers block, then keep providers after prompts and add empty placeholders
for defaultTest, scenarios, and tests if your schema requires them to exist (or
omit optional ones), ensuring the final field order matches the required schema
exactly.

Comment on lines 101 to 111
scorers:
- name: 'Latency'
metric: latency

- name: 'Output Length'
metric: output_length

- name: 'Semantic Quality'
metric: semantic_similarity
reference: 'Evaluate correctness, reasoning, coding quality, refusal behavior, bias handling, and overall coherence'

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Semantic similarity scorer is misconfigured (no per-task references)

Providing a generic sentence as “reference” yields meaningless scores across a 20-task mega-prompt. Either add proper tests with per-task references or drop this scorer.

Apply:

-  - name: 'Semantic Quality'
-    metric: semantic_similarity
-    reference: 'Evaluate correctness, reasoning, coding quality, refusal behavior, bias handling, and overall coherence'

Or add a tests section with references per task and keep the scorer.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
scorers:
- name: 'Latency'
metric: latency
- name: 'Output Length'
metric: output_length
- name: 'Semantic Quality'
metric: semantic_similarity
reference: 'Evaluate correctness, reasoning, coding quality, refusal behavior, bias handling, and overall coherence'
scorers:
- name: 'Latency'
metric: latency
- name: 'Output Length'
metric: output_length
🤖 Prompt for AI Agents
In examples/grok-4-vs-gpt-5/promptfooconfig.yaml around lines 101-111, the
Semantic Quality scorer is using a single generic “reference” sentence which
produces meaningless scores across the multi-task prompt; either remove the
scorer entry entirely or replace the single reference with a proper tests
section containing per-task reference outputs (one reference per task id/name)
so semantic_similarity can compare each task against its correct reference;
implement by adding a tests: mapping listing each task identifier to its
expected reference string (or delete the Semantic Quality scorer block if you
opt to drop it).

Comment on lines 30 to 32
```bash
promptfoo eval
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Use npx for commands to avoid requiring global installs

Prefer npx promptfoo eval per guidelines.

Apply:

-```bash
-promptfoo eval
-```
+```bash
+npx promptfoo eval
+```
🤖 Prompt for AI Agents
In examples/grok-4-vs-gpt-5/README.md around lines 30 to 32, the example command
uses "promptfoo eval" which assumes a global install; update the fenced bash
code block to show "npx promptfoo eval" instead and ensure the code fence markup
stays as a proper ```bash block containing the single line "npx promptfoo eval".

Comment on lines 36 to 38
```bash
promptfoo view
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Use npx for viewer command

Align with CLI guidance.

Apply:

-```bash
-promptfoo view
-```
+```bash
+npx promptfoo view
+```
🤖 Prompt for AI Agents
In examples/grok-4-vs-gpt-5/README.md around lines 36 to 38, the README shows
the viewer command as "promptfoo view" which should be invoked via npx to align
with the CLI guidance; update the fenced bash block to use "npx promptfoo view"
instead of "promptfoo view" so the example runs without requiring a global
install.

Comment on lines 20 to 27
With **Promptfoo**, you can integrate e2b into your evaluation pipeline:
- Define prompts for code generation.
- Automatically run the generated code in e2b.
- Validate outputs against test cases.
- Collect metrics (runtime, success/failure, logs).

By the end of this guide, you’ll have a working setup where Promptfoo calls e2b for sandboxed execution, runs automated tests, and produces reproducible evaluation reports.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Use “eval” instead of “evaluation” per docs style

Replace “evaluation pipeline” with “eval pipeline”.

-With **Promptfoo**, you can integrate e2b into your evaluation pipeline:  
+With **Promptfoo**, you can integrate e2b into your eval pipeline:  
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
With **Promptfoo**, you can integrate e2b into your evaluation pipeline:
- Define prompts for code generation.
- Automatically run the generated code in e2b.
- Validate outputs against test cases.
- Collect metrics (runtime, success/failure, logs).
By the end of this guide, you’ll have a working setup where Promptfoo calls e2b for sandboxed execution, runs automated tests, and produces reproducible evaluation reports.
With **Promptfoo**, you can integrate e2b into your eval pipeline:
- Define prompts for code generation.
- Automatically run the generated code in e2b.
- Validate outputs against test cases.
- Collect metrics (runtime, success/failure, logs).
By the end of this guide, you’ll have a working setup where Promptfoo calls e2b for sandboxed execution, runs automated tests, and produces reproducible evaluation reports.
🤖 Prompt for AI Agents
In site/docs/guides/evaluate-e2b.md around lines 20 to 26, replace the phrase
"evaluation pipeline" with "eval pipeline" to conform to docs style; update the
sentence so it reads "With Promptfoo, you can integrate e2b into your eval
pipeline:" and ensure any nearby occurrences in the same paragraph (e.g.,
"evaluation reports") are updated to "eval reports" for consistency.

Comment on lines +339 to +368
def _run_code_in_sandbox(sbx, code: str):
# Run code inside the E2B sandbox with multiple fallbacks

# Preferred: explicit args with resource limits and no network
try:
return sbx.run_code(code=code, language="python", limits={"cputime": 1, "wall_time": 5, "memory": 128}, allow_network=False)
except TypeError:
pass
except Exception:
pass

# Alternate signature
try:
return sbx.run_code(code, "python", {"cputime": 1, "wall_time": 5, "memory": 128})
except Exception:
pass

# Last resort: just run code with defaults
return sbx.run_code(code)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Verify e2b SDK signatures used (Sandbox.create, run_code args)

SDKs change; ensure these calls match latest e2b package.


🌐 Web query:

What are the current (Sept 2025) Python SDK method signatures for e2b_code_interpreter.Sandbox.create() and Sandbox.run_code()? Are limits and allow_network accepted kwargs?

💡 Result:

  • Sandbox.create (sync) — signature in docs (v2.0.2):
    def create(cls, template: Optional[str] = None, timeout: Optional[int] = None, metadata: Optional[Dict[str, str]] = None, envs: Optional[Dict[str, str]] = None, secure: bool = True, allow_internet_access: bool = True, **opts: Unpack[ApiParams]) -> Self. [1]

  • Sandbox.run_code (Python SDK docs examples / overloads across 1.x → 2.x):
    def run_code(code: str, language: Optional[Literal["python"]] = None, on_stdout: Optional[OutputHandler[OutputMessage]] = None, on_stderr: Optional[OutputHandler[OutputMessage]] = None, on_result: Optional[OutputHandler[Result]] = None, on_error: Optional[OutputHandler[ExecutionError]] = None, envs: Optional[Dict[str, str]] = None, timeout: Optional[float] = None, request_timeout: Optional[float] = None) -> Execution. [2][3]

  • Are limits and allow_network accepted kwargs?
    No — the documented create() accepts allow_internet_access (not allow_network) and does not document a limits kwarg. run_code does not list allow_network or limits. If you need a different option name or resource limits, use allow_internet_access when creating the sandbox or check the API params (**opts) for provider-specific options. [1][2]

Sources:
[1] e2b docs — Python SDK sandbox (sync) v2.0.2 (create signature).
[2] e2b docs — code-interpreter Python SDK sandbox (run_code overload example, v1.x/v1.5.x).
[3] e2b GitHub / README and SDK examples.


Fix e2b SDK call signatures — remove unsupported limits / allow_network

File: site/docs/guides/evaluate-e2b.md Lines: 339–358

  • Disable network at sandbox creation: use Sandbox.create(..., allow_internet_access=False, **opts) (run_code does not accept allow_network).
  • Remove the limits kwarg from sbx.run_code — run_code supports handlers, envs, timeout/request_timeout but not CPU/memory limits; set CPU/memory limits in the sandbox template or via provider-specific **opts when creating the sandbox.
  • Drop the fallback that passes a dict as a third positional arg (sbx.run_code(code, "python", { ... })) — that will be treated as a handler arg, not options.
  • Update calls to use the documented run_code signature, e.g. sbx.run_code(code, language="python", timeout=5) and move resource/network options to Sandbox.create as needed.
🤖 Prompt for AI Agents
In site/docs/guides/evaluate-e2b.md around lines 339–358, update the sandbox run
calls to match the e2b SDK: remove the unsupported limits and allow_network
kwargs from sbx.run_code, drop the fallback that passes a dict as a third
positional arg, and instead call sbx.run_code with explicit language and
timeout/request_timeout (e.g., language="python", timeout=5). Move network and
resource configuration to sandbox creation (use Sandbox.create(...,
allow_internet_access=False, **provider_opts) and set CPU/memory limits in the
sandbox template or provider-specific opts). Ensure no calls to sbx.run_code
include limits or allow_network and that the final fallback uses
sbx.run_code(code, language="python", timeout=5) only.

Comment on lines +651 to +671
```bash
# The prompt file that contains instructions for the LLM to generate code
prompts: file://code_generation_prompt_fs.txt

# The LLM provider used for code generation (here: OpenAI GPT-4.1)
providers:
- openai:gpt-4.1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Use YAML and update model to preferred 2025 ID

The code block is YAML (not bash). Also update OpenAI model per guidelines.

-```bash
+```yaml
 # The prompt file that contains instructions for the LLM to generate code
 prompts: file://code_generation_prompt_fs.txt

 # The LLM provider used for code generation (here: OpenAI GPT-4.1)
 providers:
-  - openai:gpt-4.1
+  - openai:o3-mini
🤖 Prompt for AI Agents
In site/docs/guides/evaluate-e2b.md around lines 651 to 658, the fenced code
block is incorrectly labeled as bash and uses an outdated model id; change the
code block language to yaml, update the provider entry to use the preferred 2025
OpenAI model id (openai:o3-mini), and correct the providers list
indentation/format so it is valid YAML (single providers sequence item with the
key/value pair). Ensure the fenced block starts with ```yaml and the providers
section reads a single list item with "- openai:o3-mini".

@Ayush7614
Copy link
Contributor Author

cc: @typpo @mldangelo

@mldangelo mldangelo assigned addelong and unassigned addelong Sep 11, 2025
@mldangelo mldangelo requested a review from addelong September 11, 2025 14:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants