feat(e2b-example): add e2b-code-eval guide (promptfoo + e2b sandbox) #5555

Ayush7614 · 2025-09-11T13:11:37Z

Summary

This PR adds a new E2B + Promptfoo guide (e2b-code-eval/) that demonstrates secure execution of LLM-generated Python code inside the [E2B sandbox].

coderabbitai · 2025-09-11T13:23:34Z

📝 Walkthrough

Walkthrough

Adds examples/grok-4-vs-gpt-5/README.md describing a Promptfoo example that benchmarks and red-teams xAI Grok-4 against OpenAI GPT-5 via OpenRouter, including setup, environment variables, commands, and evaluation scope.
Adds examples/grok-4-vs-gpt-5/promptfooconfig.yaml configuring two providers, 20 prompts (15 benchmark, 5 red-team), scorers (latency, output length, semantic quality, robustness with grading), and output options.
Adds site/docs/guides/evaluate-e2b.md with an end-to-end guide and supporting code: validate_and_run_code_e2b.py, metrics.py, report.py, swe_runner.py, promptfooconfig.yaml, and prompt templates. Introduces functions for code safety checks, sandboxed execution, metrics, reporting, and SWE-style workflows.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Pre-merge checks (3 passed)

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title "feat(e2b-example): add e2b-code-eval guide (promptfoo + e2b sandbox)" concisely and accurately summarizes the primary change — adding an e2b-code-eval guide integrating Promptfoo with the E2B sandbox — and aligns with the provided raw_summary and PR objectives. It is specific, on-topic, and suitable for commit history.
Description Check	✅ Passed	The PR description states that it adds a new E2B + Promptfoo guide at e2b-code-eval/ demonstrating secure execution of LLM-generated Python code in the E2B sandbox, which directly matches the changes summarized in the raw_summary and the PR objectives. The description is brief but clearly related to the changeset.
Docstring Coverage	✅ Passed	No functions found in the changes. Docstring coverage check skipped.

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.

✨ Finishing touches

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 7

🧹 Nitpick comments (16)

examples/grok-4-vs-gpt-5/promptfooconfig.yaml (3)
16-100: Update prompt text to match chosen models

Several tasks reference “GPT-5”. If switching to o3-mini/4o-mini, align task names/questions accordingly to avoid confusion in outputs and scorers.

Example edits:
-    Compare Grok-4 and GPT-5 in 3 bullet points based on speed, accuracy, and reasoning depth.
+    Compare Grok-4 and OpenAI o3-mini in 3 bullet points based on speed, accuracy, and reasoning depth.
112-121: Use a stable, cost-effective judge model

If keeping model-graded scoring, pick a verified judge (e.g., openai:gpt-4o-mini or anthropic:claude-3-7-sonnet-20250219). Avoid unverified “gpt-5”.

Apply:
-    provider: openrouter:openai/gpt-5
+    provider: openai:gpt-4o-mini
1-127: Optional: Add a minimal tests block for basic sanity metrics

Even if you keep the single mega-prompt, adding a trivial test ensures consistent runs and enables future references.

Example:
+tests:
+  - vars: {}
examples/grok-4-vs-gpt-5/README.md (3)
1-1: PR title must use docs: prefix for site/ and examples/ only changes

Project rule: PRs touching only site/ or examples/ should be titled with docs:. Current: “feat(e2b-example)”.

Want me to retitle to “docs(e2b-example): add e2b-code-eval guide (promptfoo + e2b sandbox)”?

18-26: Document env vars with purpose and acquisition; include .env sample

Add brief explanations and a .env example to satisfy example README requirements.

Proposed insertion after prerequisites:
+### Environment variables
+- OPENROUTER_API_KEY: OpenRouter access token used to call both models via OpenRouter.
+  - Get one from your OpenRouter account dashboard.
+
+Example .env:
+```env
+OPENROUTER_API_KEY=sk-or-xxxxxxxx
+```
16-17: Optional: Note pricing/usage limits for OpenRouter

Briefly mention rate limits/costs and link to provider docs.
site/docs/guides/evaluate-e2b.md (10)
57-75: Headings, not emphasis, for subsections; remove trailing punctuation in headings

Fix markdownlint MD036/MD026.
-**Python installed**
+### Python installed

-**Node.js + npm installed**
+### Node.js + npm installed
794-799: Avoid bare URLs; make it a link

Fix MD034.
-This launches a local server (for example, at http://localhost:15500) and asks:
+This launches a local server (for example, at [http://localhost:15500](http://localhost:15500)) and asks:
801-801: Remove trailing colon in heading

Fix MD026.
-## What you see in the Promptfoo Web Viewer:
+## What you see in the Promptfoo Web Viewer
904-910: Remove trailing colons in headings

Fix MD026.
-### Option 1 (CLI):
+### Option 1 (CLI)
-### Option 2 (Web UI):
+### Option 2 (Web UI)
696-717: These are prompt instructions; use text, not bash; avoid nested triple backticks breakage

Switch to text and use quadruple backticks outside to safely include python inside.
-```bash
+````text
 You are a careful Python engineer. Produce only the Python function definition (no explanation) wrapped in triple backticks.
 
 Requirements:
 - Use the exact function name: {{function_name}}
 ...
 ```python
 def sum_integers(nums):
     return sum(nums)
-```
+````
---

`722-726`: **Not bash; use text for prompt template**


```diff
-```bash
+```text
 You are a Python test author. Given the function below, write 3 short `assert` tests (not using pytest) that exercise edge cases and typical cases. Return only the tests (no explanations). Function:
 
 {{function_code}}
---

`737-751`: **Not bash; use text for prompt template**


```diff
-```bash
+```text
 You are a careful engineer. Given a failing test log and repository file contents, produce a minimal unified diff patch that fixes the failing test.
 ...
 {{relevant_files}}
---

`2-5`: **Front matter description: prefer “eval” phrasing**

Minor consistency tweak with style guide.

```diff
-description: Hands-on tutorial (September 2025) on evaluating LLM code generation using Promptfoo with the e2b sandbox—includes setup, YAML tests, execution in a secure environment, and metrics reporting.
+description: Hands-on tutorial (September 2025) on LLM codegen eval using Promptfoo with the e2b sandbox—includes setup, YAML tests, secure execution, and metrics reporting.
765-771: Update narrative to current model name

The text says GPT-4.1; after changing the YAML to o3-mini, align the prose.
-- It calls OpenAI GPT-4.1 (or your chosen provider) with the problem prompts from code_generation_prompt_fs.txt.
+- It calls OpenAI o3-mini (or your chosen provider) with the problem prompts from code_generation_prompt_fs.txt.
939-961: Add a See Also section (required)

Docs pages should end sections with See Also/Related Concepts.
-## Conclusion
+## Conclusion
 ...
 - Automate daily red team runs with GitHub Actions.
+
+## See Also
+- Promptfoo eval CLI: https://www.promptfoo.dev/docs/usage/command-line/
+- Promptfoo red team: https://www.promptfoo.dev/docs/red-team/overview/
+- Providers and models: https://www.promptfoo.dev/docs/providers/overview/

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fc99240 and bfcc52a.

📒 Files selected for processing (3)

examples/grok-4-vs-gpt-5/README.md (1 hunks)
examples/grok-4-vs-gpt-5/promptfooconfig.yaml (1 hunks)
site/docs/guides/evaluate-e2b.md (1 hunks)

🧰 Additional context used

📓 Path-based instructions (8)

examples/*/promptfooconfig.yaml

📄 CodeRabbit inference engine (.cursor/rules/examples.mdc)

examples/*/promptfooconfig.yaml: Include a working promptfooconfig.yaml (or equivalent) file in each example directory
Always include the YAML schema reference at the top of configuration files: '# yaml-language-server: $schema=https://promptfoo.dev/config-schema.json'
Follow the specified field order in all configuration files: description, env (optional), prompts, providers, defaultTest (optional), scenarios (optional), tests
Ensure all configuration files pass YAML lint validation
When referencing external files in configuration, always use the 'file://' prefix
Always use the latest model versions available in 2025 in configuration files
For OpenAI, prefer models like 'openai:o3-mini' and 'openai:gpt-4o-mini' in configuration files
For Anthropic, prefer models like 'anthropic:claude-3-7-sonnet-20250219' in configuration files
For open-source models, use the latest versions available (e.g., latest Llama) in configuration files
Include a mix of providers when comparing model performance in configuration files
When demonstrating specialized capabilities (vision, audio, etc.), use models that support those features in configuration files
Format configuration files consistently
When creating examples for specific providers, always use the latest available model versions for that provider in configuration files
Update model versions when new ones become available in configuration files

Files:

examples/grok-4-vs-gpt-5/promptfooconfig.yaml

examples/*/{README.md,promptfooconfig.yaml}

📄 CodeRabbit inference engine (.cursor/rules/examples.mdc)

Include placeholder values for secrets/credentials in the README or configuration files

Files:

examples/grok-4-vs-gpt-5/promptfooconfig.yaml
examples/grok-4-vs-gpt-5/README.md

{site/**,examples/**}

📄 CodeRabbit inference engine (.cursor/rules/gh-cli-workflow.mdc)

Any pull request that only touches files in 'site/' or 'examples/' directories must use the 'docs:' prefix in the PR title, not 'feat:' or 'fix:'

Files:

examples/grok-4-vs-gpt-5/promptfooconfig.yaml
examples/grok-4-vs-gpt-5/README.md
site/docs/guides/evaluate-e2b.md

examples/**

📄 CodeRabbit inference engine (.cursor/rules/gh-cli-workflow.mdc)

When modifying examples, update existing files in 'examples/' instead of adding new ones (e.g., replace outdated model IDs rather than introducing new example files)

Place examples under examples/

Files:

examples/grok-4-vs-gpt-5/promptfooconfig.yaml
examples/grok-4-vs-gpt-5/README.md

examples/*/README.md

📄 CodeRabbit inference engine (.cursor/rules/examples.mdc)

examples/*/README.md: The README.md must begin with the folder name as an H1 heading
Every example README must include instructions on how to run it with 'npx promptfoo@latest init --example example-name'
Include a comprehensive README.md that explains the purpose, prerequisites, instructions, and expected outputs for the example
Document any model-specific capabilities or limitations in examples
Clearly list all required environment variables at the beginning of the README
For each environment variable, explain its purpose, how to obtain it, and any default values or constraints in the README
Include a sample .env file or instructions when multiple environment variables are needed in the README
Document any required API keys or credentials in the README
Provide instructions for cleaning up resources after running the example in the README
When creating examples for specific providers, explain any provider-specific configuration in the README
When creating examples for specific providers, document required environment variables in the README
When creating examples for specific providers, include information about pricing or usage limits in the README
When creating examples for specific providers, highlight unique features or capabilities in the README
When creating examples for specific providers, compare to similar providers where appropriate in the README

Files:

examples/grok-4-vs-gpt-5/README.md

examples/**/README.md

📄 CodeRabbit inference engine (CLAUDE.md)

Each example should include a clear README.md

Files:

examples/grok-4-vs-gpt-5/README.md

site/docs/**/*.md

📄 CodeRabbit inference engine (.cursor/rules/docusaurus.mdc)

site/docs/**/*.md: Prioritize minimal edits when updating existing documentation; avoid creating entirely new sections or rewriting substantial portions; focus edits on improving grammar, spelling, clarity, fixing typos, and structural improvements where needed; do not modify existing headings (h1, h2, h3, etc.) as they are often linked externally.
Structure content to reveal information progressively: begin with essential actions and information, then provide deeper context as necessary; organize information from most important to least important.
Use action-oriented language: clearly outline actionable steps users should take, use concise and direct language, prefer active voice over passive voice, and use imperative mood for instructions.
Use 'eval' instead of 'evaluation' in all documentation; when referring to command line usage, use 'npx promptfoo eval' rather than 'npx promptfoo evaluation'; maintain consistency with this terminology across all examples, code blocks, and explanations.
The project name can be written as either 'Promptfoo' (capitalized) or 'promptfoo' (lowercase) depending on context: use 'Promptfoo' at the beginning of sentences or in headings, and 'promptfoo' in code examples, terminal commands, or when referring to the package name; be consistent with the chosen capitalization within each document or section.
Each markdown documentation file must include required front matter fields: 'title' (the page title shown in search results and browser tabs) and 'description' (a concise summary of the page content, ideally 150-160 characters).
Only add a title attribute to code blocks that represent complete, runnable files; do not add titles to code fragments, partial examples, or snippets that aren't meant to be used as standalone files; this applies to all code blocks regardless of language.
Use special comment directives to highlight specific lines in code blocks: 'highlight-next-line' highlights the line immediately after the comment, 'highligh...

Files:

site/docs/guides/evaluate-e2b.md

site/**

📄 CodeRabbit inference engine (.cursor/rules/gh-cli-workflow.mdc)

If the change is a feature, update the relevant documentation under 'site/'

Files:

site/docs/guides/evaluate-e2b.md

🧠 Learnings (11)

📓 Common learnings

Learnt from: CR
PR: promptfoo/promptfoo#0
File: .cursor/rules/examples.mdc:0-0
Timestamp: 2025-07-18T17:25:38.445Z
Learning: Applies to examples/*/promptfooconfig*.yaml : Include a mix of providers when comparing model performance in configuration files

Learnt from: CR
PR: promptfoo/promptfoo#0
File: .cursor/rules/examples.mdc:0-0
Timestamp: 2025-07-18T17:25:38.444Z
Learning: Applies to examples/*/promptfooconfig*.yaml : For trivial test cases in configuration, make them quirky and fun to increase engagement

📚 Learning: 2025-07-18T17:25:38.445Z

Learnt from: CR
PR: promptfoo/promptfoo#0
File: .cursor/rules/examples.mdc:0-0
Timestamp: 2025-07-18T17:25:38.445Z
Learning: Applies to examples/*/promptfooconfig*.yaml : Include a mix of providers when comparing model performance in configuration files

Applied to files:

examples/grok-4-vs-gpt-5/promptfooconfig.yaml

📚 Learning: 2025-07-18T17:25:46.665Z

Learnt from: CR
PR: promptfoo/promptfoo#0
File: .cursor/rules/examples.mdc:0-0
Timestamp: 2025-07-18T17:25:46.665Z
Learning: Applies to examples/*/promptfooconfig.yaml : Include a mix of providers when comparing model performance in configuration files

Applied to files:

examples/grok-4-vs-gpt-5/promptfooconfig.yaml

📚 Learning: 2025-07-18T17:25:38.444Z

Learnt from: CR
PR: promptfoo/promptfoo#0
File: .cursor/rules/examples.mdc:0-0
Timestamp: 2025-07-18T17:25:38.444Z
Learning: Applies to examples/*/promptfooconfig*.yaml : Include a working promptfooconfig.yaml (or equivalent) file in each example

Applied to files:

examples/grok-4-vs-gpt-5/promptfooconfig.yaml

📚 Learning: 2025-07-18T17:25:38.444Z

Learnt from: CR
PR: promptfoo/promptfoo#0
File: .cursor/rules/examples.mdc:0-0
Timestamp: 2025-07-18T17:25:38.444Z
Learning: Applies to examples/*/promptfooconfig*.yaml : For trivial test cases in configuration, make them quirky and fun to increase engagement

Applied to files:

examples/grok-4-vs-gpt-5/promptfooconfig.yaml

📚 Learning: 2025-07-18T17:25:46.665Z

Learnt from: CR
PR: promptfoo/promptfoo#0
File: .cursor/rules/examples.mdc:0-0
Timestamp: 2025-07-18T17:25:46.665Z
Learning: Applies to examples/*/promptfooconfig.yaml : Include a working promptfooconfig.yaml (or equivalent) file in each example directory

Applied to files:

examples/grok-4-vs-gpt-5/promptfooconfig.yaml

📚 Learning: 2025-07-18T17:25:46.665Z

Learnt from: CR
PR: promptfoo/promptfoo#0
File: .cursor/rules/examples.mdc:0-0
Timestamp: 2025-07-18T17:25:46.665Z
Learning: Applies to examples/*/promptfooconfig.yaml : Format configuration files consistently

Applied to files:

examples/grok-4-vs-gpt-5/promptfooconfig.yaml

📚 Learning: 2025-07-18T17:25:38.445Z

Learnt from: CR
PR: promptfoo/promptfoo#0
File: .cursor/rules/examples.mdc:0-0
Timestamp: 2025-07-18T17:25:38.445Z
Learning: Applies to examples/*/promptfooconfig*.yaml : Format configuration files consistently

Applied to files:

examples/grok-4-vs-gpt-5/promptfooconfig.yaml

📚 Learning: 2025-07-18T17:25:38.444Z

Learnt from: CR
PR: promptfoo/promptfoo#0
File: .cursor/rules/examples.mdc:0-0
Timestamp: 2025-07-18T17:25:38.444Z
Learning: Applies to examples/*/promptfooconfig*.yaml : Follow the specific field order in all configuration files: description, env (optional), prompts, providers, defaultTest (optional), scenarios (optional), tests

Applied to files:

examples/grok-4-vs-gpt-5/promptfooconfig.yaml

📚 Learning: 2025-07-18T17:25:38.445Z

Learnt from: CR
PR: promptfoo/promptfoo#0
File: .cursor/rules/examples.mdc:0-0
Timestamp: 2025-07-18T17:25:38.445Z
Learning: Applies to examples/*/promptfooconfig*.yaml : When demonstrating specialized capabilities (vision, audio, etc.), use models that support those features in configuration files

Applied to files:

examples/grok-4-vs-gpt-5/promptfooconfig.yaml

📚 Learning: 2025-07-18T17:25:38.445Z

Learnt from: CR
PR: promptfoo/promptfoo#0
File: .cursor/rules/examples.mdc:0-0
Timestamp: 2025-07-18T17:25:38.445Z
Learning: Applies to examples/*/promptfooconfig*.yaml : Update model versions when new ones become available in configuration files

Applied to files:

examples/grok-4-vs-gpt-5/promptfooconfig.yaml

🪛 markdownlint-cli2 (0.17.2)

site/docs/guides/evaluate-e2b.md

57-57: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

64-64: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

794-794: Bare URL used

(MD034, no-bare-urls)

801-801: Trailing punctuation in heading
Punctuation: ':'

(MD026, no-trailing-punctuation)

904-904: Trailing punctuation in heading
Punctuation: ':'

(MD026, no-trailing-punctuation)

910-910: Trailing punctuation in heading
Punctuation: ':'

(MD026, no-trailing-punctuation)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (17)

GitHub Check: Test on Node 22.x and windows-latest
GitHub Check: Test on Node 24.x and windows-latest
GitHub Check: Build on Node 24.x
GitHub Check: Build on Node 22.x
GitHub Check: Build on Node 20.x
GitHub Check: Run Integration Tests
GitHub Check: Test on Node 24.x and macOS-latest
GitHub Check: Redteam
GitHub Check: Test on Node 24.x and ubuntu-latest
GitHub Check: Test on Node 20.x and windows-latest
GitHub Check: Test on Node 20.x and macOS-latest
GitHub Check: Test on Node 22.x and macOS-latest
GitHub Check: Generate Assets
GitHub Check: Build Docs
GitHub Check: webui tests
GitHub Check: Test on Node 20.x and ubuntu-latest
GitHub Check: Style Check

🔇 Additional comments (1)

examples/grok-4-vs-gpt-5/promptfooconfig.yaml (1)

11-15: openrouter:openai/gpt-5 is valid — don't replace it unless you mean to switch providers

openrouter:openai/gpt-5 is a valid OpenRouter model slug (as of Sep 11, 2025); keep it if you intend to route via OpenRouter.

If you want Promptfoo's OpenAI provider instead, use openai:o3-mini (or openai:responses:o3-mini) or openai:chat:gpt-4.1-mini — gpt-4o-mini was replaced by GPT‑4.1‑mini on May 14, 2025.
File: examples/grok-4-vs-gpt-5/promptfooconfig.yaml lines 11–15 — no change required unless switching providers.

Likely an incorrect or invalid review comment.

coderabbitai · 2025-09-11T13:23:38Z

examples/grok-4-vs-gpt-5/promptfooconfig.yaml

+providers:
+  - id: openrouter:x-ai/grok-4
+    config:
+      temperature: 0.7
+      max_tokens: 4000
+
+  - id: openrouter:openai/gpt-5
+    config:
+      temperature: 0.7
+      max_tokens: 4000
+
+prompts:


🛠️ Refactor suggestion

Reorder fields to match required schema order

Required: description, env (optional), prompts, providers, defaultTest (optional), scenarios (optional), tests. Prompts must come before providers.

Apply:

description: 'Ultimate Promptfoo Benchmark + Red Teaming: Grok-4 vs GPT-5' -providers: - - id: openrouter:x-ai/grok-4 - config: - temperature: 0.7 - max_tokens: 4000 - - - id: openrouter:openai/gpt-5 - config: - temperature: 0.7 - max_tokens: 4000 - prompts: - | You are an advanced AI assistant. Complete the following 20 tasks carefully. ... + +providers: + - id: openrouter:x-ai/grok-4 + config: + temperature: 0.7 + max_tokens: 4000 + + - id: openai:o3-mini + config: + temperature: 0.7 + max_tokens: 4000

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

providers:

- id: openrouter:x-ai/grok-4

config:

temperature: 0.7

max_tokens: 4000

- id: openrouter:openai/gpt-5

config:

temperature: 0.7

max_tokens: 4000

prompts:

description: 'Ultimate Promptfoo Benchmark + Red Teaming: Grok-4 vs GPT-5'

prompts:

- |

You are an advanced AI assistant. Complete the following 20 tasks carefully.

...

providers:

- id: openrouter:x-ai/grok-4

config:

temperature: 0.7

max_tokens: 4000

- id: openai:o3-mini

config:

temperature: 0.7

max_tokens: 4000

🤖 Prompt for AI Agents

In examples/grok-4-vs-gpt-5/promptfooconfig.yaml around lines 5 to 16, the YAML fields are out of the required schema order (description, env, prompts, providers, defaultTest, scenarios, tests); specifically prompts must appear before providers. Update the file to place a description (if missing) and/or env at the top as needed, move the prompts section so it appears before the providers block, then keep providers after prompts and add empty placeholders for defaultTest, scenarios, and tests if your schema requires them to exist (or omit optional ones), ensuring the final field order matches the required schema exactly.

coderabbitai · 2025-09-11T13:23:38Z

examples/grok-4-vs-gpt-5/promptfooconfig.yaml

+scorers:
+  - name: 'Latency'
+    metric: latency
+
+  - name: 'Output Length'
+    metric: output_length
+
+  - name: 'Semantic Quality'
+    metric: semantic_similarity
+    reference: 'Evaluate correctness, reasoning, coding quality, refusal behavior, bias handling, and overall coherence'
+


🛠️ Refactor suggestion

Semantic similarity scorer is misconfigured (no per-task references)

Providing a generic sentence as “reference” yields meaningless scores across a 20-task mega-prompt. Either add proper tests with per-task references or drop this scorer.

Apply:

- - name: 'Semantic Quality' - metric: semantic_similarity - reference: 'Evaluate correctness, reasoning, coding quality, refusal behavior, bias handling, and overall coherence'

Or add a tests section with references per task and keep the scorer.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

scorers:

- name: 'Latency'

metric: latency

- name: 'Output Length'

metric: output_length

- name: 'Semantic Quality'

metric: semantic_similarity

reference: 'Evaluate correctness, reasoning, coding quality, refusal behavior, bias handling, and overall coherence'

scorers:

- name: 'Latency'

metric: latency

- name: 'Output Length'

metric: output_length

🤖 Prompt for AI Agents

In examples/grok-4-vs-gpt-5/promptfooconfig.yaml around lines 101-111, the Semantic Quality scorer is using a single generic “reference” sentence which produces meaningless scores across the multi-task prompt; either remove the scorer entry entirely or replace the single reference with a proper tests section containing per-task reference outputs (one reference per task id/name) so semantic_similarity can compare each task against its correct reference; implement by adding a tests: mapping listing each task identifier to its expected reference string (or delete the Semantic Quality scorer block if you opt to drop it).

coderabbitai · 2025-09-11T13:23:38Z

examples/grok-4-vs-gpt-5/README.md

+```bash
+promptfoo eval
+```


🛠️ Refactor suggestion

Use npx for commands to avoid requiring global installs

Prefer npx promptfoo eval per guidelines.

Apply:

-```bash -promptfoo eval -``` +```bash +npx promptfoo eval +```

🤖 Prompt for AI Agents

In examples/grok-4-vs-gpt-5/README.md around lines 30 to 32, the example command uses "promptfoo eval" which assumes a global install; update the fenced bash code block to show "npx promptfoo eval" instead and ensure the code fence markup stays as a proper ```bash block containing the single line "npx promptfoo eval".

coderabbitai · 2025-09-11T13:23:38Z

examples/grok-4-vs-gpt-5/README.md

+```bash
+promptfoo view
+```


🛠️ Refactor suggestion

Use npx for viewer command

Align with CLI guidance.

Apply:

-```bash -promptfoo view -``` +```bash +npx promptfoo view +```

🤖 Prompt for AI Agents

In examples/grok-4-vs-gpt-5/README.md around lines 36 to 38, the README shows the viewer command as "promptfoo view" which should be invoked via npx to align with the CLI guidance; update the fenced bash block to use "npx promptfoo view" instead of "promptfoo view" so the example runs without requiring a global install.

coderabbitai · 2025-09-11T13:23:38Z

site/docs/guides/evaluate-e2b.md

+With **Promptfoo**, you can integrate e2b into your evaluation pipeline:  
+- Define prompts for code generation.  
+- Automatically run the generated code in e2b.  
+- Validate outputs against test cases.  
+- Collect metrics (runtime, success/failure, logs).  
+
+By the end of this guide, you’ll have a working setup where Promptfoo calls e2b for sandboxed execution, runs automated tests, and produces reproducible evaluation reports.


🛠️ Refactor suggestion

Use “eval” instead of “evaluation” per docs style

Replace “evaluation pipeline” with “eval pipeline”.

-With **Promptfoo**, you can integrate e2b into your evaluation pipeline: +With **Promptfoo**, you can integrate e2b into your eval pipeline:

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

With **Promptfoo**, you can integrate e2b into your evaluation pipeline:

- Define prompts for code generation.

- Automatically run the generated code in e2b.

- Validate outputs against test cases.

- Collect metrics (runtime, success/failure, logs).

By the end of this guide, you’ll have a working setup where Promptfoo calls e2b for sandboxed execution, runs automated tests, and produces reproducible evaluation reports.

With **Promptfoo**, you can integrate e2b into your eval pipeline:

- Define prompts for code generation.

- Automatically run the generated code in e2b.

- Validate outputs against test cases.

- Collect metrics (runtime, success/failure, logs).

By the end of this guide, you’ll have a working setup where Promptfoo calls e2b for sandboxed execution, runs automated tests, and produces reproducible evaluation reports.

🤖 Prompt for AI Agents

In site/docs/guides/evaluate-e2b.md around lines 20 to 26, replace the phrase "evaluation pipeline" with "eval pipeline" to conform to docs style; update the sentence so it reads "With Promptfoo, you can integrate e2b into your eval pipeline:" and ensure any nearby occurrences in the same paragraph (e.g., "evaluation reports") are updated to "eval reports" for consistency.

coderabbitai · 2025-09-11T13:23:39Z

site/docs/guides/evaluate-e2b.md

+def _run_code_in_sandbox(sbx, code: str):
+    # Run code inside the E2B sandbox with multiple fallbacks
+
+    # Preferred: explicit args with resource limits and no network
+    try:
+        return sbx.run_code(code=code, language="python", limits={"cputime": 1, "wall_time": 5, "memory": 128}, allow_network=False)
+    except TypeError:
+        pass
+    except Exception:
+        pass
+
+    # Alternate signature
+    try:
+        return sbx.run_code(code, "python", {"cputime": 1, "wall_time": 5, "memory": 128})
+    except Exception:
+        pass
+
+    # Last resort: just run code with defaults
+    return sbx.run_code(code)
+


💡 Verification agent

🧩 Analysis chain

Verify e2b SDK signatures used (Sandbox.create, run_code args)

SDKs change; ensure these calls match latest e2b package.

🌐 Web query:

What are the current (Sept 2025) Python SDK method signatures for e2b_code_interpreter.Sandbox.create() and Sandbox.run_code()? Are limits and allow_network accepted kwargs?

💡 Result:

Sandbox.create (sync) — signature in docs (v2.0.2):
def create(cls, template: Optional[str] = None, timeout: Optional[int] = None, metadata: Optional[Dict[str, str]] = None, envs: Optional[Dict[str, str]] = None, secure: bool = True, allow_internet_access: bool = True, **opts: Unpack[ApiParams]) -> Self. [1]

Sandbox.run_code (Python SDK docs examples / overloads across 1.x → 2.x):
def run_code(code: str, language: Optional[Literal["python"]] = None, on_stdout: Optional[OutputHandler[OutputMessage]] = None, on_stderr: Optional[OutputHandler[OutputMessage]] = None, on_result: Optional[OutputHandler[Result]] = None, on_error: Optional[OutputHandler[ExecutionError]] = None, envs: Optional[Dict[str, str]] = None, timeout: Optional[float] = None, request_timeout: Optional[float] = None) -> Execution. [2][3]

Are limits and allow_network accepted kwargs?
No — the documented create() accepts allow_internet_access (not allow_network) and does not document a limits kwarg. run_code does not list allow_network or limits. If you need a different option name or resource limits, use allow_internet_access when creating the sandbox or check the API params (**opts) for provider-specific options. [1][2]

Sources:
[1] e2b docs — Python SDK sandbox (sync) v2.0.2 (create signature).
[2] e2b docs — code-interpreter Python SDK sandbox (run_code overload example, v1.x/v1.5.x).
[3] e2b GitHub / README and SDK examples.

Fix e2b SDK call signatures — remove unsupported limits / allow_network

File: site/docs/guides/evaluate-e2b.md Lines: 339–358

Disable network at sandbox creation: use Sandbox.create(..., allow_internet_access=False, **opts) (run_code does not accept allow_network).

Remove the limits kwarg from sbx.run_code — run_code supports handlers, envs, timeout/request_timeout but not CPU/memory limits; set CPU/memory limits in the sandbox template or via provider-specific **opts when creating the sandbox.

Drop the fallback that passes a dict as a third positional arg (sbx.run_code(code, "python", { ... })) — that will be treated as a handler arg, not options.

Update calls to use the documented run_code signature, e.g. sbx.run_code(code, language="python", timeout=5) and move resource/network options to Sandbox.create as needed.

🤖 Prompt for AI Agents

In site/docs/guides/evaluate-e2b.md around lines 339–358, update the sandbox run calls to match the e2b SDK: remove the unsupported limits and allow_network kwargs from sbx.run_code, drop the fallback that passes a dict as a third positional arg, and instead call sbx.run_code with explicit language and timeout/request_timeout (e.g., language="python", timeout=5). Move network and resource configuration to sandbox creation (use Sandbox.create(..., allow_internet_access=False, **provider_opts) and set CPU/memory limits in the sandbox template or provider-specific opts). Ensure no calls to sbx.run_code include limits or allow_network and that the final fallback uses sbx.run_code(code, language="python", timeout=5) only.

coderabbitai · 2025-09-11T13:23:39Z

site/docs/guides/evaluate-e2b.md

+```bash
+# The prompt file that contains instructions for the LLM to generate code
+prompts: file://code_generation_prompt_fs.txt
+
+# The LLM provider used for code generation (here: OpenAI GPT-4.1)
+providers:
+  - openai:gpt-4.1
+


🛠️ Refactor suggestion

Use YAML and update model to preferred 2025 ID

The code block is YAML (not bash). Also update OpenAI model per guidelines.

-```bash +```yaml # The prompt file that contains instructions for the LLM to generate code prompts: file://code_generation_prompt_fs.txt # The LLM provider used for code generation (here: OpenAI GPT-4.1) providers: - - openai:gpt-4.1 + - openai:o3-mini

🤖 Prompt for AI Agents

In site/docs/guides/evaluate-e2b.md around lines 651 to 658, the fenced code block is incorrectly labeled as bash and uses an outdated model id; change the code block language to yaml, update the provider entry to use the preferred 2025 OpenAI model id (openai:o3-mini), and correct the providers list indentation/format so it is valid YAML (single providers sequence item with the key/value pair). Ensure the fenced block starts with ```yaml and the providers section reads a single list item with "- openai:o3-mini".

Ayush7614 · 2025-09-11T13:52:08Z

cc: @typpo @mldangelo

coderabbitai bot reviewed Sep 11, 2025

View reviewed changes

Ayush7614 force-pushed the ayush5 branch from bfcc52a to dc6477b Compare September 11, 2025 13:39

docs: add e2b-code-eval-sandbox full fledged guide

411b55d

Ayush7614 force-pushed the ayush5 branch from dc6477b to 411b55d Compare September 11, 2025 13:46

mldangelo assigned addelong and unassigned addelong Sep 11, 2025

mldangelo requested a review from addelong September 11, 2025 14:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

feat(e2b-example): add e2b-code-eval guide (promptfoo + e2b sandbox) #5555

feat(e2b-example): add e2b-code-eval guide (promptfoo + e2b sandbox) #5555

Uh oh!

Ayush7614 commented Sep 11, 2025

Uh oh!

coderabbitai bot commented Sep 11, 2025

Walkthrough

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Sep 11, 2025

Uh oh!

coderabbitai bot Sep 11, 2025

Uh oh!

coderabbitai bot Sep 11, 2025

Uh oh!

coderabbitai bot Sep 11, 2025

Uh oh!

coderabbitai bot Sep 11, 2025

Uh oh!

coderabbitai bot Sep 11, 2025

Uh oh!

coderabbitai bot Sep 11, 2025

Uh oh!

Ayush7614 commented Sep 11, 2025

Uh oh!

Uh oh!

Uh oh!

feat(e2b-example): add e2b-code-eval guide (promptfoo + e2b sandbox) #5555

Are you sure you want to change the base?

feat(e2b-example): add e2b-code-eval guide (promptfoo + e2b sandbox) #5555

Uh oh!

Conversation

Ayush7614 commented Sep 11, 2025

Summary

Uh oh!

coderabbitai bot commented Sep 11, 2025

Walkthrough

Estimated code review effort

Pre-merge checks (3 passed)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 11, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 11, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 11, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 11, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 11, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 11, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 11, 2025

Choose a reason for hiding this comment

Uh oh!

Ayush7614 commented Sep 11, 2025

Uh oh!

Uh oh!