Why does format reward equal to zero? #124

XavierCHEN34 · 2025-02-21T14:16:12Z

Thank you for your great work.

It appears that even in the training log you gave, the format reward is zero. Why?

Syazvinski · 2025-02-22T23:56:19Z

I fixed the issue by updating the "format_reward" function in grpo.py with:

def format_reward(completions, **kwargs):
    """
    Checks if the assistant text (after "assistant\n") contains a <think> block
    followed by an <answer> block, in any order, ignoring user prompt.
    """
    import re, html
    from datetime import datetime

    pattern = re.compile(
        r"<think>[\s\S]*?</think>[\s\n\r]*<answer>[\s\S]*?</answer>", 
        re.DOTALL
    )

    rewards = []
    current_time = datetime.now().strftime("%d-%H-%M-%S-%f")

    for completion in completions:
        raw = completion[0]["content"]
        # 1) Separate out assistant portion
        parts = raw.split("\nassistant\n", maxsplit=1)
        assistant_str = parts[1] if len(parts) > 1 else raw

        # 2) Unescape
        assistant_str = html.unescape(assistant_str)

        # 3) Check if it matches the <think> + <answer> pattern in that assistant text
        match_found = bool(pattern.search(assistant_str))
        reward = 1 if match_found else 0.0

        # 4) Logging
        if os.getenv("DEBUG_MODE") == "true":
            log_path = os.getenv("LOG_PATH")
            if log_path:
                with open(log_path, "a") as f:
                    f.write(f"------------- {current_time} Format reward: {reward} -------------\n")
                    f.write(f"RAW: {repr(raw)}\n\n")
                    f.write(f"ASSISTANT_STR: {repr(assistant_str)}\n\n")
                    f.write("Pattern found? " + str(match_found) + "\n\n")

        rewards.append(reward)

    return rewards

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why does format reward equal to zero? #124

Why does format reward equal to zero? #124

XavierCHEN34 commented Feb 21, 2025

Syazvinski commented Feb 22, 2025

Why does format reward equal to zero? #124

Why does format reward equal to zero? #124

Comments

XavierCHEN34 commented Feb 21, 2025

Syazvinski commented Feb 22, 2025