Skip to content

Commit 17f427e

Browse files
authored
Evaluation: Fetch Scores (#455)
* first stab at export eval score * formatting fixes * formatting fixes * updated score format * round off scores * cleanup * refactor * updated readme for endpoint * cleanup session * cleanup * using APIresponse * returning error when eval not complete * added testcases
1 parent f3b8f4d commit 17f427e

File tree

5 files changed

+678
-57
lines changed

5 files changed

+678
-57
lines changed

backend/app/api/docs/evaluation/get_evaluation.md

Lines changed: 58 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,12 @@ Retrieves comprehensive information about an evaluation run including its curren
66

77
- **evaluation_id**: ID of the evaluation run
88

9+
## Query Parameters
10+
11+
- **get_trace_info** (optional, default: false): If true, fetch and include Langfuse trace scores with Q&A context. On first request, data is fetched from Langfuse and cached in the score column. Subsequent requests return cached data. Only available for completed evaluations.
12+
13+
- **resync_score** (optional, default: false): If true, clear cached scores and re-fetch from Langfuse. Useful when new evaluators have been added or scores have been updated. Requires get_trace_info=true.
14+
915
## Returns
1016

1117
EvaluationRunPublic with current status and results:
@@ -18,11 +24,62 @@ EvaluationRunPublic with current status and results:
1824
- status: Current status (pending, running, completed, failed)
1925
- total_items: Total number of items being evaluated
2026
- completed_items: Number of items completed so far
21-
- results: Evaluation results (when completed)
27+
- score: Evaluation scores (when get_trace_info=true and status=completed)
2228
- error_message: Error message if failed
2329
- created_at: Timestamp when the evaluation was created
2430
- updated_at: Timestamp when the evaluation was last updated
2531

32+
## Score Format
33+
34+
When `get_trace_info=true` and evaluation is completed, the `score` field contains:
35+
36+
```json
37+
{
38+
"summary_scores": [
39+
{
40+
"name": "cosine_similarity",
41+
"avg": 0.87,
42+
"std": 0.12,
43+
"total_pairs": 50,
44+
"data_type": "NUMERIC"
45+
},
46+
{
47+
"name": "response_category",
48+
"distribution": {"CORRECT": 10, "PARTIAL": 5, "INCORRECT": 2},
49+
"total_pairs": 17,
50+
"data_type": "CATEGORICAL"
51+
}
52+
],
53+
"traces": [
54+
{
55+
"trace_id": "uuid-123",
56+
"question": "What is 2+2?",
57+
"llm_answer": "4",
58+
"ground_truth_answer": "4",
59+
"scores": [
60+
{
61+
"name": "cosine_similarity",
62+
"value": 0.95,
63+
"data_type": "NUMERIC"
64+
},
65+
{
66+
"name": "correctness",
67+
"value": 1,
68+
"data_type": "NUMERIC",
69+
"comment": "Response is correct"
70+
}
71+
]
72+
}
73+
]
74+
}
75+
```
76+
77+
**Notes:**
78+
- Only complete scores are included (scores where all traces have been rated)
79+
- Numeric values are rounded to 2 decimal places
80+
- NUMERIC scores show `avg` and `std` in summary
81+
- CATEGORICAL scores show `distribution` counts in summary
82+
2683
## Usage
2784

2885
Use this endpoint to poll for evaluation progress. The evaluation is processed asynchronously by Celery Beat (every 60s), so you should poll periodically to check if the status has changed to "completed" or "failed".

0 commit comments

Comments
 (0)