Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why does format reward equal to zero? #124

Open
XavierCHEN34 opened this issue Feb 21, 2025 · 1 comment
Open

Why does format reward equal to zero? #124

XavierCHEN34 opened this issue Feb 21, 2025 · 1 comment

Comments

@XavierCHEN34
Copy link

Thank you for your great work.

It appears that even in the training log you gave, the format reward is zero. Why?

@Syazvinski
Copy link

I fixed the issue by updating the "format_reward" function in grpo.py with:

def format_reward(completions, **kwargs):
    """
    Checks if the assistant text (after "assistant\n") contains a <think> block
    followed by an <answer> block, in any order, ignoring user prompt.
    """
    import re, html
    from datetime import datetime

    pattern = re.compile(
        r"<think>[\s\S]*?</think>[\s\n\r]*<answer>[\s\S]*?</answer>", 
        re.DOTALL
    )

    rewards = []
    current_time = datetime.now().strftime("%d-%H-%M-%S-%f")

    for completion in completions:
        raw = completion[0]["content"]
        # 1) Separate out assistant portion
        parts = raw.split("\nassistant\n", maxsplit=1)
        assistant_str = parts[1] if len(parts) > 1 else raw

        # 2) Unescape
        assistant_str = html.unescape(assistant_str)

        # 3) Check if it matches the <think> + <answer> pattern in that assistant text
        match_found = bool(pattern.search(assistant_str))
        reward = 1 if match_found else 0.0

        # 4) Logging
        if os.getenv("DEBUG_MODE") == "true":
            log_path = os.getenv("LOG_PATH")
            if log_path:
                with open(log_path, "a") as f:
                    f.write(f"------------- {current_time} Format reward: {reward} -------------\n")
                    f.write(f"RAW: {repr(raw)}\n\n")
                    f.write(f"ASSISTANT_STR: {repr(assistant_str)}\n\n")
                    f.write("Pattern found? " + str(match_found) + "\n\n")

        rewards.append(reward)

    return rewards

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants