-
Notifications
You must be signed in to change notification settings - Fork 57
feat - e2e tests + evals #22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
mclenhard
wants to merge
1
commit into
jerhadf:main
Choose a base branch
from
mclenhard:feat/add-e2e-test-and-evals
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,59 @@ | ||
//evals.ts | ||
|
||
import { EvalConfig } from 'mcp-evals'; | ||
import { openai } from "@ai-sdk/openai"; | ||
import { grade, EvalFunction } from "mcp-evals"; | ||
|
||
const linear_create_issueEval: EvalFunction = { | ||
name: "Linear Create Issue Tool Evaluation", | ||
description: "Evaluates the correctness and completeness of the linear_create_issue tool usage", | ||
run: async () => { | ||
const result = await grade(openai("gpt-4"), "Please create a new Linear issue using the linear_create_issue tool. The issue should have the title 'Fix login bug for Safari users', teamId 'team123', description 'Safari users are unable to log in properly', priority 2, and status 'Open'. Return the issue identifier and URL."); | ||
return JSON.parse(result); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Consider adding error handling (e.g. try/catch) around |
||
} | ||
}; | ||
|
||
const linear_update_issueEval: EvalFunction = { | ||
name: "linear_update_issue Evaluation", | ||
description: "Evaluates the tool's ability to update existing issue details in a Linear project", | ||
run: async () => { | ||
const result = await grade(openai("gpt-4"), "Please update the existing Linear issue with ID 'ISS-786' to have the title 'New Title', description 'Revised description for clarity', priority 3, and status 'In Review'."); | ||
return JSON.parse(result); | ||
} | ||
}; | ||
|
||
const linear_search_issuesEval: EvalFunction = { | ||
name: "linear_search_issues", | ||
description: "Evaluates the linear_search_issues tool for searching issues based on flexible criteria", | ||
run: async () => { | ||
const result = await grade(openai("gpt-4"), "Find any open issues assigned to user 'usr_123' that mention 'bug' in their title or description, have a priority of 2, and return no more than 5 results while ignoring archived issues."); | ||
return JSON.parse(result); | ||
} | ||
}; | ||
|
||
const linear_get_user_issuesEval: EvalFunction = { | ||
name: "linear_get_user_issues Evaluation", | ||
description: "Evaluates the correctness of retrieving user issues including optional parameters", | ||
run: async () => { | ||
const result = await grade(openai("gpt-4"), "Retrieve the assigned issues for user 123, including archived issues, limited to 10."); | ||
return JSON.parse(result); | ||
} | ||
}; | ||
|
||
const linear_add_commentEval: EvalFunction = { | ||
name: "linear_add_comment Evaluation", | ||
description: "Evaluates the correctness of the linear_add_comment functionality", | ||
run: async () => { | ||
const result = await grade(openai("gpt-4"), "Please add a comment to the Linear issue with ID ABC123. The comment should be in markdown format saying: 'Testing comment functionality!'. Use a custom user name 'TestUser' and avatar 'http://test.avatar.com'. Return the created comment's details including its URL."); | ||
return JSON.parse(result); | ||
} | ||
}; | ||
|
||
const config: EvalConfig = { | ||
model: openai("gpt-4"), | ||
evals: [linear_create_issueEval, linear_update_issueEval, linear_search_issuesEval, linear_get_user_issuesEval, linear_add_commentEval] | ||
}; | ||
|
||
export default config; | ||
|
||
export const evals = [linear_create_issueEval, linear_update_issueEval, linear_search_issuesEval, linear_get_user_issuesEval, linear_add_commentEval]; |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The evaluation functions use inconsistent naming formats for their
name
properties. For example, 'Linear Create Issue Tool Evaluation' (line 8) uses title casing and includes 'Tool Evaluation', whereas 'linear_update_issue Evaluation' (line 17), 'linear_search_issues' (line 26), 'linear_get_user_issues Evaluation' (line 35), and 'linear_add_comment Evaluation' (line 44) vary in capitalization and inclusion of the word 'Evaluation'. For better readability and consistency, consider standardizing these names.