-
-
Notifications
You must be signed in to change notification settings - Fork 698
feat(e2b-example): add e2b-code-eval guide (promptfoo + e2b sandbox) #5555
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
📝 WalkthroughWalkthrough
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Pre-merge checks (3 passed)✅ Passed checks (3 passed)
Tip 👮 Agentic pre-merge checks are now available in preview!Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.
Please see the documentation for more information. Example: reviews:
pre_merge_checks:
custom_checks:
- name: "Undocumented Breaking Changes"
mode: "warning"
instructions: |
Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal). Please share your feedback with us on this Discord post. ✨ Finishing touches🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 7
🧹 Nitpick comments (16)
examples/grok-4-vs-gpt-5/promptfooconfig.yaml (3)
16-100
: Update prompt text to match chosen modelsSeveral tasks reference “GPT-5”. If switching to o3-mini/4o-mini, align task names/questions accordingly to avoid confusion in outputs and scorers.
Example edits:
- Compare Grok-4 and GPT-5 in 3 bullet points based on speed, accuracy, and reasoning depth. + Compare Grok-4 and OpenAI o3-mini in 3 bullet points based on speed, accuracy, and reasoning depth.
112-121
: Use a stable, cost-effective judge modelIf keeping model-graded scoring, pick a verified judge (e.g., openai:gpt-4o-mini or anthropic:claude-3-7-sonnet-20250219). Avoid unverified “gpt-5”.
Apply:
- provider: openrouter:openai/gpt-5 + provider: openai:gpt-4o-mini
1-127
: Optional: Add a minimal tests block for basic sanity metricsEven if you keep the single mega-prompt, adding a trivial test ensures consistent runs and enables future references.
Example:
+tests: + - vars: {}examples/grok-4-vs-gpt-5/README.md (3)
1-1
: PR title must use docs: prefix for site/ and examples/ only changesProject rule: PRs touching only site/ or examples/ should be titled with docs:. Current: “feat(e2b-example)”.
Want me to retitle to “docs(e2b-example): add e2b-code-eval guide (promptfoo + e2b sandbox)”?
18-26
: Document env vars with purpose and acquisition; include .env sampleAdd brief explanations and a .env example to satisfy example README requirements.
Proposed insertion after prerequisites:
+### Environment variables +- OPENROUTER_API_KEY: OpenRouter access token used to call both models via OpenRouter. + - Get one from your OpenRouter account dashboard. + +Example .env: +```env +OPENROUTER_API_KEY=sk-or-xxxxxxxx +```
16-17
: Optional: Note pricing/usage limits for OpenRouterBriefly mention rate limits/costs and link to provider docs.
site/docs/guides/evaluate-e2b.md (10)
57-75
: Headings, not emphasis, for subsections; remove trailing punctuation in headingsFix markdownlint MD036/MD026.
-**Python installed** +### Python installed -**Node.js + npm installed** +### Node.js + npm installed
794-799
: Avoid bare URLs; make it a linkFix MD034.
-This launches a local server (for example, at http://localhost:15500) and asks: +This launches a local server (for example, at [http://localhost:15500](http://localhost:15500)) and asks:
801-801
: Remove trailing colon in headingFix MD026.
-## What you see in the Promptfoo Web Viewer: +## What you see in the Promptfoo Web Viewer
904-910
: Remove trailing colons in headingsFix MD026.
-### Option 1 (CLI): +### Option 1 (CLI) -### Option 2 (Web UI): +### Option 2 (Web UI)
696-717
: These are prompt instructions; use text, not bash; avoid nested triple backticks breakageSwitch to text and use quadruple backticks outside to safely include
python
inside.-```bash +````text You are a careful Python engineer. Produce only the Python function definition (no explanation) wrapped in triple backticks. Requirements: - Use the exact function name: {{function_name}} ... ```python def sum_integers(nums): return sum(nums)-```
+````--- `722-726`: **Not bash; use text for prompt template** ```diff -```bash +```text You are a Python test author. Given the function below, write 3 short `assert` tests (not using pytest) that exercise edge cases and typical cases. Return only the tests (no explanations). Function: {{function_code}}
--- `737-751`: **Not bash; use text for prompt template** ```diff -```bash +```text You are a careful engineer. Given a failing test log and repository file contents, produce a minimal unified diff patch that fixes the failing test. ... {{relevant_files}}
--- `2-5`: **Front matter description: prefer “eval” phrasing** Minor consistency tweak with style guide. ```diff -description: Hands-on tutorial (September 2025) on evaluating LLM code generation using Promptfoo with the e2b sandbox—includes setup, YAML tests, execution in a secure environment, and metrics reporting. +description: Hands-on tutorial (September 2025) on LLM codegen eval using Promptfoo with the e2b sandbox—includes setup, YAML tests, secure execution, and metrics reporting.
765-771
: Update narrative to current model nameThe text says GPT-4.1; after changing the YAML to o3-mini, align the prose.
-- It calls OpenAI GPT-4.1 (or your chosen provider) with the problem prompts from code_generation_prompt_fs.txt. +- It calls OpenAI o3-mini (or your chosen provider) with the problem prompts from code_generation_prompt_fs.txt.
939-961
: Add a See Also section (required)Docs pages should end sections with See Also/Related Concepts.
-## Conclusion +## Conclusion ... - Automate daily red team runs with GitHub Actions. + +## See Also +- Promptfoo eval CLI: https://www.promptfoo.dev/docs/usage/command-line/ +- Promptfoo red team: https://www.promptfoo.dev/docs/red-team/overview/ +- Providers and models: https://www.promptfoo.dev/docs/providers/overview/
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
examples/grok-4-vs-gpt-5/README.md
(1 hunks)examples/grok-4-vs-gpt-5/promptfooconfig.yaml
(1 hunks)site/docs/guides/evaluate-e2b.md
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (8)
examples/*/promptfooconfig.yaml
📄 CodeRabbit inference engine (.cursor/rules/examples.mdc)
examples/*/promptfooconfig.yaml
: Include a working promptfooconfig.yaml (or equivalent) file in each example directory
Always include the YAML schema reference at the top of configuration files: '# yaml-language-server: $schema=https://promptfoo.dev/config-schema.json'
Follow the specified field order in all configuration files: description, env (optional), prompts, providers, defaultTest (optional), scenarios (optional), tests
Ensure all configuration files pass YAML lint validation
When referencing external files in configuration, always use the 'file://' prefix
Always use the latest model versions available in 2025 in configuration files
For OpenAI, prefer models like 'openai:o3-mini' and 'openai:gpt-4o-mini' in configuration files
For Anthropic, prefer models like 'anthropic:claude-3-7-sonnet-20250219' in configuration files
For open-source models, use the latest versions available (e.g., latest Llama) in configuration files
Include a mix of providers when comparing model performance in configuration files
When demonstrating specialized capabilities (vision, audio, etc.), use models that support those features in configuration files
Format configuration files consistently
When creating examples for specific providers, always use the latest available model versions for that provider in configuration files
Update model versions when new ones become available in configuration files
Files:
examples/grok-4-vs-gpt-5/promptfooconfig.yaml
examples/*/{README.md,promptfooconfig.yaml}
📄 CodeRabbit inference engine (.cursor/rules/examples.mdc)
Include placeholder values for secrets/credentials in the README or configuration files
Files:
examples/grok-4-vs-gpt-5/promptfooconfig.yaml
examples/grok-4-vs-gpt-5/README.md
{site/**,examples/**}
📄 CodeRabbit inference engine (.cursor/rules/gh-cli-workflow.mdc)
Any pull request that only touches files in 'site/' or 'examples/' directories must use the 'docs:' prefix in the PR title, not 'feat:' or 'fix:'
Files:
examples/grok-4-vs-gpt-5/promptfooconfig.yaml
examples/grok-4-vs-gpt-5/README.md
site/docs/guides/evaluate-e2b.md
examples/**
📄 CodeRabbit inference engine (.cursor/rules/gh-cli-workflow.mdc)
When modifying examples, update existing files in 'examples/' instead of adding new ones (e.g., replace outdated model IDs rather than introducing new example files)
Place examples under examples/
Files:
examples/grok-4-vs-gpt-5/promptfooconfig.yaml
examples/grok-4-vs-gpt-5/README.md
examples/*/README.md
📄 CodeRabbit inference engine (.cursor/rules/examples.mdc)
examples/*/README.md
: The README.md must begin with the folder name as an H1 heading
Every example README must include instructions on how to run it with 'npx promptfoo@latest init --example example-name'
Include a comprehensive README.md that explains the purpose, prerequisites, instructions, and expected outputs for the example
Document any model-specific capabilities or limitations in examples
Clearly list all required environment variables at the beginning of the README
For each environment variable, explain its purpose, how to obtain it, and any default values or constraints in the README
Include a sample .env file or instructions when multiple environment variables are needed in the README
Document any required API keys or credentials in the README
Provide instructions for cleaning up resources after running the example in the README
When creating examples for specific providers, explain any provider-specific configuration in the README
When creating examples for specific providers, document required environment variables in the README
When creating examples for specific providers, include information about pricing or usage limits in the README
When creating examples for specific providers, highlight unique features or capabilities in the README
When creating examples for specific providers, compare to similar providers where appropriate in the README
Files:
examples/grok-4-vs-gpt-5/README.md
examples/**/README.md
📄 CodeRabbit inference engine (CLAUDE.md)
Each example should include a clear README.md
Files:
examples/grok-4-vs-gpt-5/README.md
site/docs/**/*.md
📄 CodeRabbit inference engine (.cursor/rules/docusaurus.mdc)
site/docs/**/*.md
: Prioritize minimal edits when updating existing documentation; avoid creating entirely new sections or rewriting substantial portions; focus edits on improving grammar, spelling, clarity, fixing typos, and structural improvements where needed; do not modify existing headings (h1, h2, h3, etc.) as they are often linked externally.
Structure content to reveal information progressively: begin with essential actions and information, then provide deeper context as necessary; organize information from most important to least important.
Use action-oriented language: clearly outline actionable steps users should take, use concise and direct language, prefer active voice over passive voice, and use imperative mood for instructions.
Use 'eval' instead of 'evaluation' in all documentation; when referring to command line usage, use 'npx promptfoo eval' rather than 'npx promptfoo evaluation'; maintain consistency with this terminology across all examples, code blocks, and explanations.
The project name can be written as either 'Promptfoo' (capitalized) or 'promptfoo' (lowercase) depending on context: use 'Promptfoo' at the beginning of sentences or in headings, and 'promptfoo' in code examples, terminal commands, or when referring to the package name; be consistent with the chosen capitalization within each document or section.
Each markdown documentation file must include required front matter fields: 'title' (the page title shown in search results and browser tabs) and 'description' (a concise summary of the page content, ideally 150-160 characters).
Only add a title attribute to code blocks that represent complete, runnable files; do not add titles to code fragments, partial examples, or snippets that aren't meant to be used as standalone files; this applies to all code blocks regardless of language.
Use special comment directives to highlight specific lines in code blocks: 'highlight-next-line' highlights the line immediately after the comment, 'highligh...
Files:
site/docs/guides/evaluate-e2b.md
site/**
📄 CodeRabbit inference engine (.cursor/rules/gh-cli-workflow.mdc)
If the change is a feature, update the relevant documentation under 'site/'
Files:
site/docs/guides/evaluate-e2b.md
🧠 Learnings (11)
📓 Common learnings
Learnt from: CR
PR: promptfoo/promptfoo#0
File: .cursor/rules/examples.mdc:0-0
Timestamp: 2025-07-18T17:25:38.445Z
Learning: Applies to examples/*/promptfooconfig*.yaml : Include a mix of providers when comparing model performance in configuration files
Learnt from: CR
PR: promptfoo/promptfoo#0
File: .cursor/rules/examples.mdc:0-0
Timestamp: 2025-07-18T17:25:38.444Z
Learning: Applies to examples/*/promptfooconfig*.yaml : For trivial test cases in configuration, make them quirky and fun to increase engagement
📚 Learning: 2025-07-18T17:25:38.445Z
Learnt from: CR
PR: promptfoo/promptfoo#0
File: .cursor/rules/examples.mdc:0-0
Timestamp: 2025-07-18T17:25:38.445Z
Learning: Applies to examples/*/promptfooconfig*.yaml : Include a mix of providers when comparing model performance in configuration files
Applied to files:
examples/grok-4-vs-gpt-5/promptfooconfig.yaml
📚 Learning: 2025-07-18T17:25:46.665Z
Learnt from: CR
PR: promptfoo/promptfoo#0
File: .cursor/rules/examples.mdc:0-0
Timestamp: 2025-07-18T17:25:46.665Z
Learning: Applies to examples/*/promptfooconfig.yaml : Include a mix of providers when comparing model performance in configuration files
Applied to files:
examples/grok-4-vs-gpt-5/promptfooconfig.yaml
📚 Learning: 2025-07-18T17:25:38.444Z
Learnt from: CR
PR: promptfoo/promptfoo#0
File: .cursor/rules/examples.mdc:0-0
Timestamp: 2025-07-18T17:25:38.444Z
Learning: Applies to examples/*/promptfooconfig*.yaml : Include a working promptfooconfig.yaml (or equivalent) file in each example
Applied to files:
examples/grok-4-vs-gpt-5/promptfooconfig.yaml
📚 Learning: 2025-07-18T17:25:38.444Z
Learnt from: CR
PR: promptfoo/promptfoo#0
File: .cursor/rules/examples.mdc:0-0
Timestamp: 2025-07-18T17:25:38.444Z
Learning: Applies to examples/*/promptfooconfig*.yaml : For trivial test cases in configuration, make them quirky and fun to increase engagement
Applied to files:
examples/grok-4-vs-gpt-5/promptfooconfig.yaml
📚 Learning: 2025-07-18T17:25:46.665Z
Learnt from: CR
PR: promptfoo/promptfoo#0
File: .cursor/rules/examples.mdc:0-0
Timestamp: 2025-07-18T17:25:46.665Z
Learning: Applies to examples/*/promptfooconfig.yaml : Include a working promptfooconfig.yaml (or equivalent) file in each example directory
Applied to files:
examples/grok-4-vs-gpt-5/promptfooconfig.yaml
📚 Learning: 2025-07-18T17:25:46.665Z
Learnt from: CR
PR: promptfoo/promptfoo#0
File: .cursor/rules/examples.mdc:0-0
Timestamp: 2025-07-18T17:25:46.665Z
Learning: Applies to examples/*/promptfooconfig.yaml : Format configuration files consistently
Applied to files:
examples/grok-4-vs-gpt-5/promptfooconfig.yaml
📚 Learning: 2025-07-18T17:25:38.445Z
Learnt from: CR
PR: promptfoo/promptfoo#0
File: .cursor/rules/examples.mdc:0-0
Timestamp: 2025-07-18T17:25:38.445Z
Learning: Applies to examples/*/promptfooconfig*.yaml : Format configuration files consistently
Applied to files:
examples/grok-4-vs-gpt-5/promptfooconfig.yaml
📚 Learning: 2025-07-18T17:25:38.444Z
Learnt from: CR
PR: promptfoo/promptfoo#0
File: .cursor/rules/examples.mdc:0-0
Timestamp: 2025-07-18T17:25:38.444Z
Learning: Applies to examples/*/promptfooconfig*.yaml : Follow the specific field order in all configuration files: description, env (optional), prompts, providers, defaultTest (optional), scenarios (optional), tests
Applied to files:
examples/grok-4-vs-gpt-5/promptfooconfig.yaml
📚 Learning: 2025-07-18T17:25:38.445Z
Learnt from: CR
PR: promptfoo/promptfoo#0
File: .cursor/rules/examples.mdc:0-0
Timestamp: 2025-07-18T17:25:38.445Z
Learning: Applies to examples/*/promptfooconfig*.yaml : When demonstrating specialized capabilities (vision, audio, etc.), use models that support those features in configuration files
Applied to files:
examples/grok-4-vs-gpt-5/promptfooconfig.yaml
📚 Learning: 2025-07-18T17:25:38.445Z
Learnt from: CR
PR: promptfoo/promptfoo#0
File: .cursor/rules/examples.mdc:0-0
Timestamp: 2025-07-18T17:25:38.445Z
Learning: Applies to examples/*/promptfooconfig*.yaml : Update model versions when new ones become available in configuration files
Applied to files:
examples/grok-4-vs-gpt-5/promptfooconfig.yaml
🪛 markdownlint-cli2 (0.17.2)
site/docs/guides/evaluate-e2b.md
57-57: Emphasis used instead of a heading
(MD036, no-emphasis-as-heading)
64-64: Emphasis used instead of a heading
(MD036, no-emphasis-as-heading)
794-794: Bare URL used
(MD034, no-bare-urls)
801-801: Trailing punctuation in heading
Punctuation: ':'
(MD026, no-trailing-punctuation)
904-904: Trailing punctuation in heading
Punctuation: ':'
(MD026, no-trailing-punctuation)
910-910: Trailing punctuation in heading
Punctuation: ':'
(MD026, no-trailing-punctuation)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (17)
- GitHub Check: Test on Node 22.x and windows-latest
- GitHub Check: Test on Node 24.x and windows-latest
- GitHub Check: Build on Node 24.x
- GitHub Check: Build on Node 22.x
- GitHub Check: Build on Node 20.x
- GitHub Check: Run Integration Tests
- GitHub Check: Test on Node 24.x and macOS-latest
- GitHub Check: Redteam
- GitHub Check: Test on Node 24.x and ubuntu-latest
- GitHub Check: Test on Node 20.x and windows-latest
- GitHub Check: Test on Node 20.x and macOS-latest
- GitHub Check: Test on Node 22.x and macOS-latest
- GitHub Check: Generate Assets
- GitHub Check: Build Docs
- GitHub Check: webui tests
- GitHub Check: Test on Node 20.x and ubuntu-latest
- GitHub Check: Style Check
🔇 Additional comments (1)
examples/grok-4-vs-gpt-5/promptfooconfig.yaml (1)
11-15
: openrouter:openai/gpt-5 is valid — don't replace it unless you mean to switch providers
- openrouter:openai/gpt-5 is a valid OpenRouter model slug (as of Sep 11, 2025); keep it if you intend to route via OpenRouter.
- If you want Promptfoo's OpenAI provider instead, use openai:o3-mini (or openai:responses:o3-mini) or openai:chat:gpt-4.1-mini — gpt-4o-mini was replaced by GPT‑4.1‑mini on May 14, 2025.
File: examples/grok-4-vs-gpt-5/promptfooconfig.yaml lines 11–15 — no change required unless switching providers.Likely an incorrect or invalid review comment.
providers: | ||
- id: openrouter:x-ai/grok-4 | ||
config: | ||
temperature: 0.7 | ||
max_tokens: 4000 | ||
|
||
- id: openrouter:openai/gpt-5 | ||
config: | ||
temperature: 0.7 | ||
max_tokens: 4000 | ||
|
||
prompts: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Reorder fields to match required schema order
Required: description, env (optional), prompts, providers, defaultTest (optional), scenarios (optional), tests. Prompts must come before providers.
Apply:
description: 'Ultimate Promptfoo Benchmark + Red Teaming: Grok-4 vs GPT-5'
-providers:
- - id: openrouter:x-ai/grok-4
- config:
- temperature: 0.7
- max_tokens: 4000
-
- - id: openrouter:openai/gpt-5
- config:
- temperature: 0.7
- max_tokens: 4000
-
prompts:
- |
You are an advanced AI assistant. Complete the following 20 tasks carefully.
...
+
+providers:
+ - id: openrouter:x-ai/grok-4
+ config:
+ temperature: 0.7
+ max_tokens: 4000
+
+ - id: openai:o3-mini
+ config:
+ temperature: 0.7
+ max_tokens: 4000
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
providers: | |
- id: openrouter:x-ai/grok-4 | |
config: | |
temperature: 0.7 | |
max_tokens: 4000 | |
- id: openrouter:openai/gpt-5 | |
config: | |
temperature: 0.7 | |
max_tokens: 4000 | |
prompts: | |
description: 'Ultimate Promptfoo Benchmark + Red Teaming: Grok-4 vs GPT-5' | |
prompts: | |
- | | |
You are an advanced AI assistant. Complete the following 20 tasks carefully. | |
... | |
providers: | |
- id: openrouter:x-ai/grok-4 | |
config: | |
temperature: 0.7 | |
max_tokens: 4000 | |
- id: openai:o3-mini | |
config: | |
temperature: 0.7 | |
max_tokens: 4000 |
🤖 Prompt for AI Agents
In examples/grok-4-vs-gpt-5/promptfooconfig.yaml around lines 5 to 16, the YAML
fields are out of the required schema order (description, env, prompts,
providers, defaultTest, scenarios, tests); specifically prompts must appear
before providers. Update the file to place a description (if missing) and/or env
at the top as needed, move the prompts section so it appears before the
providers block, then keep providers after prompts and add empty placeholders
for defaultTest, scenarios, and tests if your schema requires them to exist (or
omit optional ones), ensuring the final field order matches the required schema
exactly.
scorers: | ||
- name: 'Latency' | ||
metric: latency | ||
|
||
- name: 'Output Length' | ||
metric: output_length | ||
|
||
- name: 'Semantic Quality' | ||
metric: semantic_similarity | ||
reference: 'Evaluate correctness, reasoning, coding quality, refusal behavior, bias handling, and overall coherence' | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Semantic similarity scorer is misconfigured (no per-task references)
Providing a generic sentence as “reference” yields meaningless scores across a 20-task mega-prompt. Either add proper tests with per-task references or drop this scorer.
Apply:
- - name: 'Semantic Quality'
- metric: semantic_similarity
- reference: 'Evaluate correctness, reasoning, coding quality, refusal behavior, bias handling, and overall coherence'
Or add a tests section with references per task and keep the scorer.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
scorers: | |
- name: 'Latency' | |
metric: latency | |
- name: 'Output Length' | |
metric: output_length | |
- name: 'Semantic Quality' | |
metric: semantic_similarity | |
reference: 'Evaluate correctness, reasoning, coding quality, refusal behavior, bias handling, and overall coherence' | |
scorers: | |
- name: 'Latency' | |
metric: latency | |
- name: 'Output Length' | |
metric: output_length |
🤖 Prompt for AI Agents
In examples/grok-4-vs-gpt-5/promptfooconfig.yaml around lines 101-111, the
Semantic Quality scorer is using a single generic “reference” sentence which
produces meaningless scores across the multi-task prompt; either remove the
scorer entry entirely or replace the single reference with a proper tests
section containing per-task reference outputs (one reference per task id/name)
so semantic_similarity can compare each task against its correct reference;
implement by adding a tests: mapping listing each task identifier to its
expected reference string (or delete the Semantic Quality scorer block if you
opt to drop it).
examples/grok-4-vs-gpt-5/README.md
Outdated
```bash | ||
promptfoo eval | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Use npx for commands to avoid requiring global installs
Prefer npx promptfoo eval per guidelines.
Apply:
-```bash
-promptfoo eval
-```
+```bash
+npx promptfoo eval
+```
🤖 Prompt for AI Agents
In examples/grok-4-vs-gpt-5/README.md around lines 30 to 32, the example command
uses "promptfoo eval" which assumes a global install; update the fenced bash
code block to show "npx promptfoo eval" instead and ensure the code fence markup
stays as a proper ```bash block containing the single line "npx promptfoo eval".
examples/grok-4-vs-gpt-5/README.md
Outdated
```bash | ||
promptfoo view | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Use npx for viewer command
Align with CLI guidance.
Apply:
-```bash
-promptfoo view
-```
+```bash
+npx promptfoo view
+```
🤖 Prompt for AI Agents
In examples/grok-4-vs-gpt-5/README.md around lines 36 to 38, the README shows
the viewer command as "promptfoo view" which should be invoked via npx to align
with the CLI guidance; update the fenced bash block to use "npx promptfoo view"
instead of "promptfoo view" so the example runs without requiring a global
install.
site/docs/guides/evaluate-e2b.md
Outdated
With **Promptfoo**, you can integrate e2b into your evaluation pipeline: | ||
- Define prompts for code generation. | ||
- Automatically run the generated code in e2b. | ||
- Validate outputs against test cases. | ||
- Collect metrics (runtime, success/failure, logs). | ||
|
||
By the end of this guide, you’ll have a working setup where Promptfoo calls e2b for sandboxed execution, runs automated tests, and produces reproducible evaluation reports. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Use “eval” instead of “evaluation” per docs style
Replace “evaluation pipeline” with “eval pipeline”.
-With **Promptfoo**, you can integrate e2b into your evaluation pipeline:
+With **Promptfoo**, you can integrate e2b into your eval pipeline:
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
With **Promptfoo**, you can integrate e2b into your evaluation pipeline: | |
- Define prompts for code generation. | |
- Automatically run the generated code in e2b. | |
- Validate outputs against test cases. | |
- Collect metrics (runtime, success/failure, logs). | |
By the end of this guide, you’ll have a working setup where Promptfoo calls e2b for sandboxed execution, runs automated tests, and produces reproducible evaluation reports. | |
With **Promptfoo**, you can integrate e2b into your eval pipeline: | |
- Define prompts for code generation. | |
- Automatically run the generated code in e2b. | |
- Validate outputs against test cases. | |
- Collect metrics (runtime, success/failure, logs). | |
By the end of this guide, you’ll have a working setup where Promptfoo calls e2b for sandboxed execution, runs automated tests, and produces reproducible evaluation reports. |
🤖 Prompt for AI Agents
In site/docs/guides/evaluate-e2b.md around lines 20 to 26, replace the phrase
"evaluation pipeline" with "eval pipeline" to conform to docs style; update the
sentence so it reads "With Promptfoo, you can integrate e2b into your eval
pipeline:" and ensure any nearby occurrences in the same paragraph (e.g.,
"evaluation reports") are updated to "eval reports" for consistency.
def _run_code_in_sandbox(sbx, code: str): | ||
# Run code inside the E2B sandbox with multiple fallbacks | ||
|
||
# Preferred: explicit args with resource limits and no network | ||
try: | ||
return sbx.run_code(code=code, language="python", limits={"cputime": 1, "wall_time": 5, "memory": 128}, allow_network=False) | ||
except TypeError: | ||
pass | ||
except Exception: | ||
pass | ||
|
||
# Alternate signature | ||
try: | ||
return sbx.run_code(code, "python", {"cputime": 1, "wall_time": 5, "memory": 128}) | ||
except Exception: | ||
pass | ||
|
||
# Last resort: just run code with defaults | ||
return sbx.run_code(code) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Verify e2b SDK signatures used (Sandbox.create, run_code args)
SDKs change; ensure these calls match latest e2b package.
🌐 Web query:
What are the current (Sept 2025) Python SDK method signatures for e2b_code_interpreter.Sandbox.create() and Sandbox.run_code()? Are limits and allow_network accepted kwargs?
💡 Result:
-
Sandbox.create (sync) — signature in docs (v2.0.2):
def create(cls, template: Optional[str] = None, timeout: Optional[int] = None, metadata: Optional[Dict[str, str]] = None, envs: Optional[Dict[str, str]] = None, secure: bool = True, allow_internet_access: bool = True, **opts: Unpack[ApiParams]) -> Self. [1] -
Sandbox.run_code (Python SDK docs examples / overloads across 1.x → 2.x):
def run_code(code: str, language: Optional[Literal["python"]] = None, on_stdout: Optional[OutputHandler[OutputMessage]] = None, on_stderr: Optional[OutputHandler[OutputMessage]] = None, on_result: Optional[OutputHandler[Result]] = None, on_error: Optional[OutputHandler[ExecutionError]] = None, envs: Optional[Dict[str, str]] = None, timeout: Optional[float] = None, request_timeout: Optional[float] = None) -> Execution. [2][3] -
Are limits and allow_network accepted kwargs?
No — the documented create() accepts allow_internet_access (not allow_network) and does not document a limits kwarg. run_code does not list allow_network or limits. If you need a different option name or resource limits, use allow_internet_access when creating the sandbox or check the API params (**opts) for provider-specific options. [1][2]
Sources:
[1] e2b docs — Python SDK sandbox (sync) v2.0.2 (create signature).
[2] e2b docs — code-interpreter Python SDK sandbox (run_code overload example, v1.x/v1.5.x).
[3] e2b GitHub / README and SDK examples.
Fix e2b SDK call signatures — remove unsupported limits
/ allow_network
File: site/docs/guides/evaluate-e2b.md Lines: 339–358
- Disable network at sandbox creation: use Sandbox.create(..., allow_internet_access=False, **opts) (run_code does not accept
allow_network
). - Remove the
limits
kwarg from sbx.run_code — run_code supports handlers, envs,timeout
/request_timeout
but not CPU/memory limits; set CPU/memory limits in the sandbox template or via provider-specific **opts when creating the sandbox. - Drop the fallback that passes a dict as a third positional arg (sbx.run_code(code, "python", { ... })) — that will be treated as a handler arg, not options.
- Update calls to use the documented run_code signature, e.g. sbx.run_code(code, language="python", timeout=5) and move resource/network options to Sandbox.create as needed.
🤖 Prompt for AI Agents
In site/docs/guides/evaluate-e2b.md around lines 339–358, update the sandbox run
calls to match the e2b SDK: remove the unsupported limits and allow_network
kwargs from sbx.run_code, drop the fallback that passes a dict as a third
positional arg, and instead call sbx.run_code with explicit language and
timeout/request_timeout (e.g., language="python", timeout=5). Move network and
resource configuration to sandbox creation (use Sandbox.create(...,
allow_internet_access=False, **provider_opts) and set CPU/memory limits in the
sandbox template or provider-specific opts). Ensure no calls to sbx.run_code
include limits or allow_network and that the final fallback uses
sbx.run_code(code, language="python", timeout=5) only.
```bash | ||
# The prompt file that contains instructions for the LLM to generate code | ||
prompts: file://code_generation_prompt_fs.txt | ||
|
||
# The LLM provider used for code generation (here: OpenAI GPT-4.1) | ||
providers: | ||
- openai:gpt-4.1 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Use YAML and update model to preferred 2025 ID
The code block is YAML (not bash). Also update OpenAI model per guidelines.
-```bash
+```yaml
# The prompt file that contains instructions for the LLM to generate code
prompts: file://code_generation_prompt_fs.txt
# The LLM provider used for code generation (here: OpenAI GPT-4.1)
providers:
- - openai:gpt-4.1
+ - openai:o3-mini
🤖 Prompt for AI Agents
In site/docs/guides/evaluate-e2b.md around lines 651 to 658, the fenced code
block is incorrectly labeled as bash and uses an outdated model id; change the
code block language to yaml, update the provider entry to use the preferred 2025
OpenAI model id (openai:o3-mini), and correct the providers list
indentation/format so it is valid YAML (single providers sequence item with the
key/value pair). Ensure the fenced block starts with ```yaml and the providers
section reads a single list item with "- openai:o3-mini".
cc: @typpo @mldangelo |
Summary
This PR adds a new E2B + Promptfoo guide (
e2b-code-eval/
) that demonstrates secure execution of LLM-generated Python code inside the [E2B sandbox].