Skip to content

Commit 36bfda2

Browse files
jsondaicopybara-github
authored andcommitted
feat: GenAI SDK client(evals) - Add support for rubric-based metrics, and rubric customization eval workflow
PiperOrigin-RevId: 783472488
1 parent c49aa40 commit 36bfda2

File tree

5 files changed

+605
-217
lines changed

5 files changed

+605
-217
lines changed
Lines changed: 181 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,181 @@
1+
# Copyright 2025 Google LLC
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
#
15+
# pylint: disable=protected-access,bad-continuation,missing-function-docstring
16+
17+
18+
from tests.unit.vertexai.genai.replays import pytest_helper
19+
from vertexai._genai import types
20+
import pandas as pd
21+
22+
_TEST_RUBRIC_GENERATION_PROMPT = """SPECIAL INSTRUCTION: think silently. Silent thinking token budget: 16384.
23+
24+
You are a teacher who is responsible for scoring a student\'s response to a prompt. In order to score that response, you must write down a rubric for each prompt. That rubric states what properties the response must have in order to be a valid response to the prompt. Properties are weighted by importance via the "importance" field.
25+
26+
Rubric requirements:
27+
- Properties either exist or don\'t exist.
28+
- Properties can be either implicit in the prompt or made explicit by the prompt.
29+
- Make sure to always include the correct expected human language as one of the properties. If the prompt asks for code, the programming language should be covered by a separate property.
30+
- The correct expected language may be explicit in the text of the prompt but is usually simply implicit in the prompt itself.
31+
- Be as comprehensive as possible with the list of properties in the rubric.
32+
- All properties in the rubric must be in English, regardless of the language of the prompt.
33+
- Rubric properties should not specify correct answers in their descriptions, e.g. to math and factoid questions if the prompt calls for such an answer. Rather, it should check that the response contains an answer and optional supporting evidence if relevant, and assume some other process will later validate correctness. A rubric property should however call out any false premises present in the prompt.
34+
35+
About importance:
36+
- Most properties will be of medium importance by default.
37+
- Properties of high importance are critical to be fulfilled in a good response.
38+
- Properties of low importance are considered optional or supplementary nice-to-haves.
39+
40+
You will see prompts in many different languages, not just English. For each prompt you see, you will write down this rubric in JSON format.
41+
42+
IMPORTANT: Never respond to the prompt given. Only write a rubric.
43+
44+
Example:
45+
What is the tallest building in the world?
46+
47+
```json
48+
{
49+
"criteria":[
50+
{
51+
"rubric_id": "00001",
52+
"property": "The response is in English.",
53+
"type": "LANGUAGE:PRIMARY_RESPONSE_LANGUAGE",
54+
"importance": "high"
55+
},
56+
{
57+
"rubric_id": "00002",
58+
"property": "Contains the name of the tallest building in the world.",
59+
"type": "QA_ANSWER:FACTOID",
60+
"importance": "high"
61+
},
62+
{
63+
"rubric_id": "00003",
64+
"property": "Contains the exact height of the tallest building.",
65+
"type": "QA_SUPPORTING_EVIDENCE:HEIGHT",
66+
"importance": "low"
67+
},
68+
{
69+
"rubric_id": "00004",
70+
"property": "Contains the location of the tallest building.",
71+
"type": "QA_SUPPORTING_EVIDENCE:LOCATION",
72+
"importance": "low"
73+
},
74+
...
75+
]
76+
}
77+
```
78+
79+
Write me a letter to my HOA asking them to reconsider the fees they are asking me to pay because I haven\'t mowed my lawn on time. I have been very busy at work.
80+
```json
81+
{
82+
"criteria": [
83+
{
84+
"rubric_id": "00001",
85+
"property": "The response is in English.",
86+
"type": "LANGUAGE:PRIMARY_RESPONSE_LANGUAGE",
87+
"importance": "high"
88+
},
89+
{
90+
"rubric_id": "00002",
91+
"property": "The response is formatted as a letter.",
92+
"type": "FORMAT_REQUIREMENT:FORMAL_LETTER",
93+
"importance": "medium"
94+
},
95+
{
96+
"rubric_id": "00003",
97+
"property": "The letter is addressed to the Homeowners Association (HOA).",
98+
"type": "CONTENT_REQUIREMENT:ADDRESSEE",
99+
"importance": "medium"
100+
},
101+
{
102+
"rubric_id": "00004",
103+
"property": "The letter explains that the sender has not mowed their lawn on time.",
104+
"type": "CONTENT_REQUIREMENT:BACKGROUND_CONTEXT:TARDINESS",
105+
"importance": "medium"
106+
},
107+
{
108+
"rubric_id": "00005",
109+
"property": "The letter provides a reason for not mowing the lawn, specifically being busy at work.",
110+
"type": "CONTENT_REQUIREMENT:EXPLANATION:EXCUSE:BUSY",
111+
"importance": "medium"
112+
},
113+
{
114+
"rubric_id": "00006",
115+
"property": "The letter discusses that the sender has been in compliance until now.",
116+
"type": "OPTIONAL_CONTENT:SUPPORTING_EVIDENCE:COMPLIANCE",
117+
"importance": "low"
118+
},
119+
{
120+
"rubric_id": "00007",
121+
"property": "The letter requests that the HOA reconsider the fees associated with not mowing the lawn on time.",
122+
"type": "CONTENT_REQUIREMENT:REQUEST:FEE_WAIVER",
123+
"importance": "high"
124+
},
125+
{
126+
"rubric_id": "00008",
127+
"property": "The letter maintains a polite and respectful tone.",
128+
"type": "CONTENT_REQUIREMENT:FORMALITY:FORMAL",
129+
"importance": "high"
130+
},
131+
{
132+
"rubric_id": "00009",
133+
"property": "The letter includes a closing (e.g., \'Sincerely\') and the sender\'s name.",
134+
"type": "CONTENT_REQUIREMENT:SIGNATURE",
135+
"importance": "medium"
136+
}
137+
]
138+
}
139+
```
140+
141+
Now write a rubric for the following user prompt. Remember to write only the rubric, NOT response to the prompt.
142+
143+
User prompt:
144+
{prompt}"""
145+
146+
147+
def test_public_method_generate_rubrics(client):
148+
"""Tests the public generate_rubrics method."""
149+
prompts_df = pd.DataFrame(
150+
{
151+
"prompt": [
152+
"Explain the theory of relativity in one sentence.",
153+
"Write a short poem about a cat.",
154+
]
155+
}
156+
)
157+
data_with_rubrics = client.evals.generate_rubrics(
158+
src=prompts_df,
159+
prompt_template=_TEST_RUBRIC_GENERATION_PROMPT,
160+
rubric_group_name="text_quality_rubrics",
161+
)
162+
163+
# Assertions focus on the returned DataFrame
164+
assert isinstance(data_with_rubrics, pd.DataFrame)
165+
assert "rubric_groups" in data_with_rubrics.columns
166+
assert len(data_with_rubrics) == 2
167+
168+
# Check the structure of the first row's rubric_groups
169+
first_rubric_group = data_with_rubrics["rubric_groups"][0]
170+
assert isinstance(first_rubric_group, dict)
171+
assert "text_quality_rubrics" in first_rubric_group
172+
assert isinstance(first_rubric_group["text_quality_rubrics"], list)
173+
assert first_rubric_group["text_quality_rubrics"]
174+
assert isinstance(first_rubric_group["text_quality_rubrics"][0], types.Rubric)
175+
176+
177+
pytestmark = pytest_helper.setup(
178+
file=__file__,
179+
globals_for_file=globals(),
180+
test_method="evals.generate_rubrics",
181+
)

tests/unit/vertexai/genai/test_evals.py

Lines changed: 58 additions & 66 deletions
Original file line numberDiff line numberDiff line change
@@ -17,9 +17,9 @@
1717
import os
1818
import statistics
1919
from unittest import mock
20-
import google.auth.credentials
2120
import warnings
2221

22+
import google.auth.credentials
2323
from google.cloud import aiplatform
2424
import vertexai
2525
from google.cloud.aiplatform import initializer as aiplatform_initializer
@@ -45,6 +45,16 @@
4545
pytestmark = pytest.mark.usefixtures("google_auth_mock")
4646

4747

48+
def _create_content_dump(text: str) -> dict[str, list[genai_types.Content]]:
49+
return {
50+
"contents": [
51+
genai_types.Content(parts=[genai_types.Part(text=text)]).model_dump(
52+
mode="json", exclude_none=True
53+
)
54+
]
55+
}
56+
57+
4858
@pytest.fixture
4959
def mock_api_client_fixture():
5060
mock_client = mock.Mock(spec=client.Client)
@@ -2709,15 +2719,11 @@ def setup_method(self):
27092719
def test_build_request_payload_basic_filtering_and_fields(self):
27102720
metric = vertexai_genai_types.LLMMetric(
27112721
name="test_quality",
2712-
prompt_template=(
2713-
"Eval: {prompt} with {response}. Context: "
2714-
"{custom_context}. Ref: {reference}"
2715-
),
2722+
prompt_template="Eval: {prompt} with {response}. Context: {custom_context}. Ref: {reference}",
27162723
)
27172724
handler = _evals_metric_handlers.LLMMetricHandler(
27182725
module=self.mock_evals_module, metric=metric
27192726
)
2720-
27212727
eval_case = vertexai_genai_types.EvalCase(
27222728
prompt=genai_types.Content(
27232729
parts=[genai_types.Part(text="User prompt text")]
@@ -2734,52 +2740,35 @@ def test_build_request_payload_basic_filtering_and_fields(self):
27342740
parts=[genai_types.Part(text="Ground truth text")]
27352741
)
27362742
),
2737-
custom_context="Custom context value.", # pylint: disable=unexpected-keyword-arg
2738-
extra_field_not_in_template="This should be excluded.", # pylint: disable=unexpected-keyword-arg
2743+
custom_context="Custom context value.",
2744+
extra_field_not_in_template="This should be excluded.",
27392745
eval_case_id="case-123",
27402746
)
27412747

27422748
payload = handler._build_request_payload(eval_case=eval_case, response_index=0)
27432749

2744-
expected_json_instance_dict = {
2745-
"prompt": "User prompt text",
2746-
"response": "Model response text",
2747-
"custom_context": "Custom context value.",
2748-
"reference": "Ground truth text",
2750+
expected_content_map = {
2751+
"prompt": _create_content_dump("User prompt text"),
2752+
"response": _create_content_dump("Model response text"),
2753+
"custom_context": _create_content_dump("Custom context value."),
2754+
"reference": _create_content_dump("Ground truth text"),
27492755
}
2756+
actual_content_map_dict = payload["pointwise_metric_input"]["instance"][
2757+
"content_map_instance"
2758+
]["values"]
27502759

2751-
actual_json_instance_str = payload["pointwise_metric_input"]["instance"][
2752-
"json_instance"
2753-
]
2754-
actual_json_instance_dict = json.loads(actual_json_instance_str)
2755-
2756-
assert actual_json_instance_dict == expected_json_instance_dict
2757-
assert "extra_field_not_in_template" not in actual_json_instance_dict
2758-
assert "eval_case_id" not in actual_json_instance_dict
2759-
2760-
assert (
2761-
"custom_output_format_config"
2762-
not in payload["pointwise_metric_input"]["metric_spec"]
2763-
)
2764-
assert (
2765-
"system_instruction" not in payload["pointwise_metric_input"]["metric_spec"]
2766-
)
2767-
assert "autorater_config" not in payload
2760+
assert actual_content_map_dict == expected_content_map
2761+
assert "extra_field_not_in_template" not in actual_content_map_dict
2762+
assert "eval_case_id" not in actual_content_map_dict
27682763

27692764
def test_build_request_payload_various_field_types(self):
27702765
metric = vertexai_genai_types.LLMMetric(
2771-
name="complex_eval",
2772-
prompt_template=(
2773-
"P: {prompt}, R: {response}, Hist: {conversation_history}, "
2774-
"SysInstruct: {system_instruction}, "
2775-
"DictField: {dict_field}, ListField: {list_field}, "
2776-
"IntField: {int_field}, BoolField: {bool_field}"
2777-
),
2766+
name="test_various_fields",
2767+
prompt_template="{prompt}{response}{conversation_history}{system_instruction}{dict_field}{list_field}{int_field}{bool_field}",
27782768
)
27792769
handler = _evals_metric_handlers.LLMMetricHandler(
27802770
module=self.mock_evals_module, metric=metric
27812771
)
2782-
27832772
eval_case = vertexai_genai_types.EvalCase(
27842773
prompt=genai_types.Content(parts=[genai_types.Part(text="The Prompt")]),
27852774
responses=[
@@ -2804,21 +2793,18 @@ def test_build_request_payload_various_field_types(self):
28042793
system_instruction=genai_types.Content(
28052794
parts=[genai_types.Part(text="System instructions here.")]
28062795
),
2807-
dict_field={ # pylint: disable=unexpected-keyword-arg
2808-
"key1": "val1",
2809-
"key2": [1, 2],
2810-
},
2811-
list_field=["a", "b", {"c": 3}], # pylint: disable=unexpected-keyword-arg
2812-
int_field=42, # pylint: disable=unexpected-keyword-arg
2813-
bool_field=True, # pylint: disable=unexpected-keyword-arg
2796+
dict_field={"key1": "val1", "key2": [1, 2]},
2797+
list_field=["a", "b", {"c": 3}],
2798+
int_field=42,
2799+
bool_field=True,
28142800
)
28152801

28162802
payload = handler._build_request_payload(eval_case=eval_case, response_index=0)
2817-
actual_json_instance_dict = json.loads(
2818-
payload["pointwise_metric_input"]["instance"]["json_instance"]
2819-
)
2803+
actual_content_map_dict = payload["pointwise_metric_input"]["instance"][
2804+
"content_map_instance"
2805+
]["values"]
28202806

2821-
expected_json_instance_dict = {
2807+
expected_texts = {
28222808
"prompt": "The Prompt",
28232809
"response": "The Response",
28242810
"conversation_history": "user: Turn 1 user\nmodel: Turn 1 model",
@@ -2828,16 +2814,20 @@ def test_build_request_payload_various_field_types(self):
28282814
"int_field": "42",
28292815
"bool_field": "True",
28302816
}
2831-
assert actual_json_instance_dict == expected_json_instance_dict
2817+
expected_content_map = {
2818+
key: _create_content_dump(text) for key, text in expected_texts.items()
2819+
}
2820+
2821+
assert actual_content_map_dict == expected_content_map
28322822

28332823
def test_build_request_payload_optional_metric_configs_set(self):
28342824
metric = vertexai_genai_types.LLMMetric(
2835-
name="configured_metric",
2836-
prompt_template="P: {prompt}, R: {response}",
2825+
name="test_optional_configs",
2826+
prompt_template="{prompt}{response}",
2827+
judge_model="gemini-1.5-pro",
2828+
judge_model_sampling_count=5,
2829+
judge_model_system_instruction="You are a fair judge.",
28372830
return_raw_output=True,
2838-
judge_model_system_instruction="Be a fair judge.",
2839-
judge_model="gemini-pro",
2840-
judge_model_sampling_count=10,
28412831
)
28422832
handler = _evals_metric_handlers.LLMMetricHandler(
28432833
module=self.mock_evals_module, metric=metric
@@ -2853,23 +2843,25 @@ def test_build_request_payload_optional_metric_configs_set(self):
28532843

28542844
payload = handler._build_request_payload(eval_case=eval_case, response_index=0)
28552845

2856-
expected_json_instance = {"prompt": "p", "response": "r"}
2857-
actual_json_instance = json.loads(
2858-
payload["pointwise_metric_input"]["instance"]["json_instance"]
2859-
)
2860-
assert actual_json_instance == expected_json_instance
2846+
expected_content_map = {
2847+
"prompt": _create_content_dump("p"),
2848+
"response": _create_content_dump("r"),
2849+
}
2850+
actual_content_map_dict = payload["pointwise_metric_input"]["instance"][
2851+
"content_map_instance"
2852+
]["values"]
2853+
assert actual_content_map_dict == expected_content_map
28612854

28622855
metric_spec_payload = payload["pointwise_metric_input"]["metric_spec"]
28632856
assert (
2864-
metric_spec_payload["metric_prompt_template"]
2865-
== "P: {prompt}, R: {response}"
2857+
metric_spec_payload["custom_output_format_config"]["return_raw_output"]
2858+
is True
28662859
)
2867-
assert metric_spec_payload["custom_output_format_config"]["return_raw_output"]
2868-
assert metric_spec_payload["system_instruction"] == "Be a fair judge."
2860+
assert metric_spec_payload["system_instruction"] == "You are a fair judge."
28692861

28702862
autorater_config_payload = payload["autorater_config"]
2871-
assert autorater_config_payload["autorater_model"] == "gemini-pro"
2872-
assert autorater_config_payload["sampling_count"] == 10
2863+
assert autorater_config_payload["autorater_model"] == "gemini-1.5-pro"
2864+
assert autorater_config_payload["sampling_count"] == 5
28732865

28742866
def test_merge_with_invalid_prompt_type(self):
28752867
raw_dataset_1 = [

0 commit comments

Comments
 (0)