Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zero accuracy on mmlu_generative #2279

Open
Luodian opened this issue Sep 5, 2024 · 12 comments
Open

zero accuracy on mmlu_generative #2279

Luodian opened this issue Sep 5, 2024 · 12 comments
Labels
bug Something isn't working.

Comments

@Luodian
Copy link

Luodian commented Sep 5, 2024

Hi thanks for providing such wonderful evaluation toolkit.

I was wondering why evaluation on mmlu_generative returns 0 accuracy whenever what models I try (pythia, qwen).

I understand it as a generative version of mmlu, it can be used to evaluate base/instruct model and match the model's output to a formatted target answer ""{{['(A)', '(B)', '(C)', '(D)'][answer]}}""

My command:

python3 -m accelerate.commands.launch --num_processes 8 --main_process_port 12399 lm_eval \
    --model hf \
    --model_args pretrained=EleutherAI/pythia-160m,revision=step100000,dtype="float" \
    --tasks mmlu_generative \
    --batch_size 32 \
    --log_samples \
    --output_path ./logs/

Results:

hf (pretrained=EleutherAI/pythia-160m,revision=step100000,dtype=float), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 32
|                 Tasks                 |Version|Filter|n-shot|  Metric   |   |Value|   |Stderr|
|---------------------------------------|------:|------|-----:|-----------|---|----:|---|-----:|
|mmlu (generative)                      |      2|none  |      |exact_match|↑  |    0|±  |     0|
|  - formal_logic                       |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - high_school_european_history       |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - high_school_us_history             |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - high_school_world_history          |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - international_law                  |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - jurisprudence                      |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - logical_fallacies                  |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - moral_disputes                     |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - moral_scenarios                    |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - philosophy                         |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - prehistory                         |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - professional_law                   |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - world_religions                    |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - business_ethics                    |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - clinical_knowledge                 |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - college_medicine                   |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - global_facts                       |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - human_aging                        |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - management                         |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - marketing                          |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - medical_genetics                   |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - miscellaneous                      |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - nutrition                          |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - professional_accounting            |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - professional_medicine              |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - virology                           |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - econometrics                       |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - high_school_geography              |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - high_school_government_and_politics|      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - high_school_macroeconomics         |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - high_school_microeconomics         |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - high_school_psychology             |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - human_sexuality                    |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - professional_psychology            |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - public_relations                   |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - security_studies                   |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - sociology                          |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - us_foreign_policy                  |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - abstract_algebra                   |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - anatomy                            |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - astronomy                          |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - college_biology                    |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - college_chemistry                  |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - college_computer_science           |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - college_mathematics                |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - college_physics                    |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - computer_security                  |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - conceptual_physics                 |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - electrical_engineering             |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - elementary_mathematics             |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - high_school_biology                |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - high_school_chemistry              |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - high_school_computer_science       |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - high_school_mathematics            |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - high_school_physics                |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - high_school_statistics             |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - machine_learning                   |      2|none  |     0|exact_match|↑  |    0|±  |     0|

|     Groups      |Version|Filter|n-shot|  Metric   |   |Value|   |Stderr|
|-----------------|------:|------|------|-----------|---|----:|---|-----:|
|mmlu (generative)|      2|none  |      |exact_match|↑  |    0|±  |     0|
@baberabb baberabb added the asking questions For asking for clarification / support on library usage. label Sep 5, 2024
@baberabb
Copy link
Contributor

baberabb commented Sep 5, 2024

I would look at the generations in the samples file, and also add some fewshots to the context (say --num_fewshot 5) to prompt the model with the desired format. Might have a bit more luck but pythia-160m is probably too small to be capable of cohesive generations.

@Luodian
Copy link
Author

Luodian commented Sep 5, 2024

I think it's pretty weird, and it may not related to in-context learning. I also evaluate Qwen/Qwen2-0.5B, it's also 0-acc on mmlu_generative.

And I tested on mmlu_pro which is also a generative task, and it have normal accuracy.

hf (pretrained=Qwen/Qwen2-0.5B-Instruct), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 8
|       Tasks       |Version|    Filter    |n-shot|  Metric   |   |Value |   |Stderr|
|-------------------|------:|--------------|-----:|-----------|---|-----:|---|-----:|
|gpqa_main_zeroshot |      1|none          |     0|acc        |↑  |0.2857|±  |0.0214|
|                   |       |none          |     0|acc_norm   |↑  |0.2857|±  |0.0214|
|mmlu_pro           |      1|custom-extract|      |exact_match|↑  |0.1444|±  |0.0032|
| - biology         |      0|custom-extract|     5|exact_match|↑  |0.2483|±  |0.0161|
| - business        |      0|custom-extract|     5|exact_match|↑  |0.1166|±  |0.0114|
| - chemistry       |      0|custom-extract|     5|exact_match|↑  |0.1025|±  |0.0090|
| - computer_science|      0|custom-extract|     5|exact_match|↑  |0.1195|±  |0.0160|
| - economics       |      0|custom-extract|     5|exact_match|↑  |0.1979|±  |0.0137|
| - engineering     |      0|custom-extract|     5|exact_match|↑  |0.0918|±  |0.0093|
| - health          |      0|custom-extract|     5|exact_match|↑  |0.1467|±  |0.0124|
| - history         |      0|custom-extract|     5|exact_match|↑  |0.1706|±  |0.0193|
| - law             |      0|custom-extract|     5|exact_match|↑  |0.1317|±  |0.0102|
| - math            |      0|custom-extract|     5|exact_match|↑  |0.1288|±  |0.0091|
| - other           |      0|custom-extract|     5|exact_match|↑  |0.1591|±  |0.0120|
| - philosophy      |      0|custom-extract|     5|exact_match|↑  |0.1423|±  |0.0157|
| - physics         |      0|custom-extract|     5|exact_match|↑  |0.1101|±  |0.0087|
| - psychology      |      0|custom-extract|     5|exact_match|↑  |0.2268|±  |0.0148|

| Groups |Version|    Filter    |n-shot|  Metric   |   |Value |   |Stderr|
|--------|------:|--------------|------|-----------|---|-----:|---|-----:|
|mmlu_pro|      1|custom-extract|      |exact_match|↑  |0.1444|±  |0.0032|

@Luodian
Copy link
Author

Luodian commented Sep 5, 2024

Qwen2-0.5B-Instruct on mmlu_generative.

hf (pretrained=Qwen/Qwen2-0.5B-Instruct), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 8
|                 Tasks                 |Version|Filter|n-shot|  Metric   |   |Value|   |Stderr|
|---------------------------------------|------:|------|-----:|-----------|---|----:|---|-----:|
|mmlu (generative)                      |      2|none  |      |exact_match|↑  |    0|±  |     0|
|  - formal_logic                       |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - high_school_european_history       |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - high_school_us_history             |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - high_school_world_history          |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - international_law                  |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - jurisprudence                      |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - logical_fallacies                  |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - moral_disputes                     |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - moral_scenarios                    |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - philosophy                         |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - prehistory                         |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - professional_law                   |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - world_religions                    |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - business_ethics                    |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - clinical_knowledge                 |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - college_medicine                   |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - global_facts                       |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - human_aging                        |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - management                         |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - marketing                          |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - medical_genetics                   |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - miscellaneous                      |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - nutrition                          |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - professional_accounting            |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - professional_medicine              |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - virology                           |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - econometrics                       |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - high_school_geography              |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - high_school_government_and_politics|      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - high_school_macroeconomics         |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - high_school_microeconomics         |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - high_school_psychology             |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - human_sexuality                    |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - professional_psychology            |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - public_relations                   |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - security_studies                   |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - sociology                          |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - us_foreign_policy                  |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - abstract_algebra                   |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - anatomy                            |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - astronomy                          |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - college_biology                    |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - college_chemistry                  |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - college_computer_science           |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - college_mathematics                |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - college_physics                    |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - computer_security                  |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - conceptual_physics                 |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - electrical_engineering             |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - elementary_mathematics             |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - high_school_biology                |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - high_school_chemistry              |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - high_school_computer_science       |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - high_school_mathematics            |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - high_school_physics                |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - high_school_statistics             |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|  - machine_learning                   |      2|none  |     0|exact_match|↑  |    0|±  |     0|

|     Groups      |Version|Filter|n-shot|  Metric   |   |Value|   |Stderr|
|-----------------|------:|------|------|-----------|---|----:|---|-----:|
|mmlu (generative)|      2|none  |      |exact_match|↑  |    0|±  |     0|

@baberabb
Copy link
Contributor

baberabb commented Sep 6, 2024

I'll take a look! My guess is a bug in the answer extraction

@baberabb baberabb added bug Something isn't working. and removed asking questions For asking for clarification / support on library usage. labels Sep 6, 2024
@AishaAlaagib
Copy link

Hello, I am having similar result (0 for all subtasks) and I am wondering if you have figured it out?

@1436033631
Copy link

Hello, I also have this error while using the mmlu_generative task to benchmark the llama3 model.

Command:

python3 main.py \
	--model hf \
	--model_args pretrained=model-path\
	--tasks mmlu_humanities_generative \
	--limit 3 \
	--output_path output/ \
	--write_out

Result:

|           Tasks            |Version|Filter|n-shot|  Metric   |   |Value|   |Stderr|
|----------------------------|------:|------|-----:|-----------|---|----:|---|-----:|
|formal_logic                |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|high_school_european_history|      2|none  |     0|exact_match|↑  |    0|±  |     0|
|high_school_us_history      |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|high_school_world_history   |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|international_law           |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|jurisprudence               |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|logical_fallacies           |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|moral_disputes              |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|moral_scenarios             |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|philosophy                  |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|prehistory                  |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|professional_law            |      2|none  |     0|exact_match|↑  |    0|±  |     0|
|world_religions             |      2|none  |     0|exact_match|↑  |    0|±  |     0|

I also try to dump some intermediate result after add some log info:

a) the prompt input text: add print log for the generate_until API in lm_eval/models/huggingface.py

The following are multiple choice questions (with answers) about world religions.

Which of the following plays the most significant role in forming a child's political views?
A. The geographical area in which the child grows up
B. The child's family
C. The media to which the child is exposed
D. The child's religion
Answer:

b) LLM response from self._model_generate:

The child's religion

It seems the response result looks normal, but the value of exact_match from the final result table is always 0.

Could you plase help to take a look? Thanks

@AishaAlaagib
Copy link

AishaAlaagib commented Oct 31, 2024 via email

@RawthiL
Copy link
Contributor

RawthiL commented Oct 31, 2024

It is a bug in the extraction filtering. Take a look at the this log:

{"doc_id": 9, "doc": {"question": "According to Kant, morality requires us to:", "subject": "philosophy", "choices": ["perform the action that leads to the greatest total happiness.", "act only on maxims that we can will to become universal laws.", "behave only in such a way as a perfectly virtuous person would behave.", "place the interests of others above the interests of ourselves."], "answer": 1}, "target": "B", "arguments": {"gen_args_0": {"arg_0": "The following are multiple choice questions (with answers) about philosophy.\n\nPsychological egoism is:\nA. an ethical theory about how we ought to behave.\nB. a generalization concerning the way people tend to behave.\nC. a claim about human nature and the ways people are capable of behaving.\nD. none of the above.\nAnswer: C\n\nAccording to Moore’s “ideal utilitarianism,” the right action is the one that brings about the greatest amount of:\nA. pleasure.\nB. happiness.\nC. good.\nD. virtue.\nAnswer: C\n\nAccording to d'Holbach, people always act according to _____.\nA. free choices\nB. dictates of the soul\nC. necessary natural laws\nD. undetermined will\nAnswer: C\n\nAccording to Kant, morality requires us to:\nA. perform the action that leads to the greatest total happiness.\nB. act only on maxims that we can will to become universal laws.\nC. behave only in such a way as a perfectly virtuous person would behave.\nD. place the interests of others above the interests of ourselves.\nAnswer:", "arg_1": {"until": ["</s>", "\n"]}}}, "resps": [[" B"]], "filtered_resps": [" B"], "doc_hash": "c5177394044574b9c8f03867fc2e5db56e8e8904af717f33f6701af2f62c4b17", "prompt_hash": "18cd89493222e9a9fe80fd0b2beaf39dffc9abe61ff3abeb1ad50d9d33ac731c", "target_hash": "df7e70e5021544f4834bbee64a9e3789febc4be81470df629cad6ddb03320a5c", "exact_match": 0.0}

it returns "exact_match": 0.0 because "filtered_resps": [" B"], is not equal to "target": "B",, note the initial space in the filtered answer, this is a normal issue, and I also observed it in BBH.

If we modify the task and templates like this:

files and changes - `_mmlu.yaml`
group: mmlu_generative
group_alias: mmlu (generative)
task:
  - group: stem
    task:
      - mmlu_stem_generative
    aggregate_metric_list:
      - metric: exact_match
        weight_by_size: True
        filter_list: get_response
  - group: other
    task:
      - mmlu_other_generative
    aggregate_metric_list:
      - metric: exact_match
        weight_by_size: True
        filter_list: get_response
  - group: social sciences
    task:
      - mmlu_social_sciences_generative
    aggregate_metric_list:
      - metric: exact_match
        weight_by_size: True
        filter_list: get_response
  - group: humanities
    task:
      - mmlu_humanities_generative
    aggregate_metric_list:
      - metric: exact_match
        weight_by_size: True
        filter_list: get_response
aggregate_metric_list:
  - aggregation: mean
    metric: exact_match
    weight_by_size: True
    filter_list: get_response
metadata:
  version: 2
  • _default_template_yaml
dataset_path: hails/mmlu_no_train # a copy of `cais/mmlu` with no auxiliary_train split
test_split: test
fewshot_split: dev
fewshot_config:
  sampler: first_n
output_type: generate_until
doc_to_text: "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:"
doc_to_target: "{{['A', 'B', 'C', 'D'][answer]}}"
generation_kwargs:
  until:
    - "</s>"
    - "\n"
metric_list:
  - metric: exact_match
    aggregation: mean
    higher_is_better: true
filter_list:
  - name: get_response
    filter:
      # Filter everything after the first break line
      - function: "regex"
        regex_pattern: "^(.*?)(?=\\n|$)"
      # Remove leading white spaces
      - function: remove_whitespace
      # function to ignore right white spaces or line breaks
      - function: "regex"
        regex_pattern: "^(.*?)\\s*$"
      - function: take_first
metadata:
  version: 2.0
dataset_kwargs:
  trust_remote_code: true

We will get the expected result:

{"doc_id": 9, "doc": {"question": "According to Kant, morality requires us to:", "subject": "philosophy", "choices": ["perform the action that leads to the greatest total happiness.", "act only on maxims that we can will to become universal laws.", "behave only in such a way as a perfectly virtuous person would behave.", "place the interests of others above the interests of ourselves."], "answer": 1}, "target": "B", "arguments": {"gen_args_0": {"arg_0": "The following are multiple choice questions (with answers) about philosophy.\n\nPsychological egoism is:\nA. an ethical theory about how we ought to behave.\nB. a generalization concerning the way people tend to behave.\nC. a claim about human nature and the ways people are capable of behaving.\nD. none of the above.\nAnswer: C\n\nAccording to Moore’s “ideal utilitarianism,” the right action is the one that brings about the greatest amount of:\nA. pleasure.\nB. happiness.\nC. good.\nD. virtue.\nAnswer: C\n\nAccording to d'Holbach, people always act according to _____.\nA. free choices\nB. dictates of the soul\nC. necessary natural laws\nD. undetermined will\nAnswer: C\n\nAccording to Kant, morality requires us to:\nA. perform the action that leads to the greatest total happiness.\nB. act only on maxims that we can will to become universal laws.\nC. behave only in such a way as a perfectly virtuous person would behave.\nD. place the interests of others above the interests of ourselves.\nAnswer:", "arg_1": {"until": ["</s>", "\n"]}}}, "resps": [[" B"]], "filtered_resps": ["B"], "doc_hash": "c5177394044574b9c8f03867fc2e5db56e8e8904af717f33f6701af2f62c4b17", "prompt_hash": "18cd89493222e9a9fe80fd0b2beaf39dffc9abe61ff3abeb1ad50d9d33ac731c", "target_hash": "df7e70e5021544f4834bbee64a9e3789febc4be81470df629cad6ddb03320a5c", "exact_match": 1.0}

see "exact_match": 1.0 at the end of the line.

I tested this on Qwen2.5-32B-Instruct-AWQ (only 50 samples)
The accuracy changed from all zeros to:

|      Groups      |Version|   Filter   |n-shot|  Metric   |   |Value |   |Stderr|
|------------------|-------|------------|------|-----------|---|-----:|---|-----:|
|mmlu (generative) |      2|get_response|      |exact_match|↑  |0.8351|±  |0.0067|
| - humanities     |    N/A|get_response|      |exact_match|↑  |0.8523|±  |0.0136|
| - other          |    N/A|get_response|      |exact_match|↑  |0.8231|±  |0.0144|
| - social sciences|    N/A|get_response|      |exact_match|↑  |0.8700|±  |0.0132|
| - stem           |    N/A|get_response|      |exact_match|↑  |0.8095|±  |0.0122|

This is the same problem I observed in BBH, I'm planning on creaiting a PR later

Edit: Added 'take_first' to filter, it changes nothing here (in terms of results), but it breaks exact match if multiple words are going to be matched.

@1436033631
Copy link

Hi RawthiL
Thanks for pointing out the missing config for the YAML file. But there are some differences in the output sequence of our model, and here is "State-sponsored prayer during school hours is prohibited, but voluntary prayer by student groups before school is allowed" after applying the above patch for the filter config.

we can see the output matches the context of D, but the exact_match is equal to 0 since the response after the filter is not equal to "D". Do you have any experience with this special response for the filter?

Thanks

{"doc_id": 0, "doc": {"question": "Which of the following best describes the balance the Supreme Court has struck between the establishment clause and the free-exercise clause?", "subject": "high_school_government_and_politics", "choices": ["Freedom of speech is protected except in certain situations, such as yelling \"fire\" in a crowded theater.", "Once a church has been recognized by the federal government, its tax-exempt status can never be revoked.", "Once Congress has created an administrative agency, that agency can be dissolved only by a constitutional amendment.", "State-sponsored prayer during school hours is prohibited, but voluntary prayer by student groups before school is allowed."], "answer": 3}, "target": "D", "arguments": {"gen_args_0": {"arg_0": "The following are multiple choice questions (with answers) about human aging.\n\nWhich of the following best describes the balance the Supreme Court has struck between the establishment clause and the free-exercise clause?\nA. Freedom of speech is protected except in certain situations, such as yelling \"fire\" in a crowded theater.\nB. Once a church has been recognized by the federal government, its tax-exempt status can never be revoked.\nC. Once Congress has created an administrative agency, that agency can be dissolved only by a constitutional amendment.\nD. State-sponsored prayer during school hours is prohibited, but voluntary prayer by student groups before school is allowed.\nAnswer:", "arg_1": {"until": ["</s>", "\n"]}}}, "resps": [[" State-sponsored prayer during school hours is prohibited, but voluntary prayer by student groups before school is allowed."]], "filtered_resps": ["State-sponsored prayer during school hours is prohibited, but voluntary prayer by student groups before school is allowed."], "doc_hash": "8f63cebd5269df80a7f6386afb6ea7266a908ffe6b72f431cf962d8dc3948358", "prompt_hash": "f63bb19b3a6c11a40c8939643328509dfd97d1b172f25a68894559a9689ba51d", "target_hash": "3f39d5c348e5b79d06e842c114e6cc571583bbf44e4b0ebfda1a01ec05745d43", "exact_match": 0.0}
```

@RawthiL
Copy link
Contributor

RawthiL commented Nov 1, 2024

{"doc_id": 0, "doc": {"question": "Which of the following best describes the balance the Supreme Court has struck between the establishment clause and the free-exercise clause?", "subject": "high_school_government_and_politics", "choices": ["Freedom of speech is protected except in certain situations, such as yelling "fire" in a crowded theater.", "Once a church has been recognized by the federal government, its tax-exempt status can never be revoked.", "Once Congress has created an administrative agency, that agency can be dissolved only by a constitutional amendment.", "State-sponsored prayer during school hours is prohibited, but voluntary prayer by student groups before school is allowed."], "answer": 3}, "target": "D", "arguments": {"gen_args_0": {"arg_0": "The following are multiple choice questions (with answers) about human aging.\n\nWhich of the following best describes the balance the Supreme Court has struck between the establishment clause and the free-exercise clause?\nA. Freedom of speech is protected except in certain situations, such as yelling "fire" in a crowded theater.\nB. Once a church has been recognized by the federal government, its tax-exempt status can never be revoked.\nC. Once Congress has created an administrative agency, that agency can be dissolved only by a constitutional amendment.\nD. State-sponsored prayer during school hours is prohibited, but voluntary prayer by student groups before school is allowed.\nAnswer:", "arg_1": {"until": ["", "\n"]}}}, "resps": [[" State-sponsored prayer during school hours is prohibited, but voluntary prayer by student groups before school is allowed."]], "filtered_resps": ["State-sponsored prayer during school hours is prohibited, but voluntary prayer by student groups before school is allowed."], "doc_hash": "8f63cebd5269df80a7f6386afb6ea7266a908ffe6b72f431cf962d8dc3948358", "prompt_hash": "f63bb19b3a6c11a40c8939643328509dfd97d1b172f25a68894559a9689ba51d", "target_hash": "3f39d5c348e5b79d06e842c114e6cc571583bbf44e4b0ebfda1a01ec05745d43", "exact_match": 0.0}

It looks like you are doing zero-shot (presenting no examples prior asking the question), this results in the model not being conditioned to respond with a letter (instead an explicit response) and hence the exact match fails.
There is no way to solve that with an exact-match, you will need to create a new test definition for zero shot and probable code a different metric (like a quasi-exact-match).
If there is no important reason for you to use zero-shot, I would suggest you to add --num_fewshots 3.

@1436033631
Copy link

{"doc_id": 0, "doc": {"question": "Which of the following best describes the balance the Supreme Court has struck between the establishment clause and the free-exercise clause?", "subject": "high_school_government_and_politics", "choices": ["Freedom of speech is protected except in certain situations, such as yelling "fire" in a crowded theater.", "Once a church has been recognized by the federal government, its tax-exempt status can never be revoked.", "Once Congress has created an administrative agency, that agency can be dissolved only by a constitutional amendment.", "State-sponsored prayer during school hours is prohibited, but voluntary prayer by student groups before school is allowed."], "answer": 3}, "target": "D", "arguments": {"gen_args_0": {"arg_0": "The following are multiple choice questions (with answers) about human aging.\n\nWhich of the following best describes the balance the Supreme Court has struck between the establishment clause and the free-exercise clause?\nA. Freedom of speech is protected except in certain situations, such as yelling "fire" in a crowded theater.\nB. Once a church has been recognized by the federal government, its tax-exempt status can never be revoked.\nC. Once Congress has created an administrative agency, that agency can be dissolved only by a constitutional amendment.\nD. State-sponsored prayer during school hours is prohibited, but voluntary prayer by student groups before school is allowed.\nAnswer:", "arg_1": {"until": ["", "\n"]}}}, "resps": [[" State-sponsored prayer during school hours is prohibited, but voluntary prayer by student groups before school is allowed."]], "filtered_resps": ["State-sponsored prayer during school hours is prohibited, but voluntary prayer by student groups before school is allowed."], "doc_hash": "8f63cebd5269df80a7f6386afb6ea7266a908ffe6b72f431cf962d8dc3948358", "prompt_hash": "f63bb19b3a6c11a40c8939643328509dfd97d1b172f25a68894559a9689ba51d", "target_hash": "3f39d5c348e5b79d06e842c114e6cc571583bbf44e4b0ebfda1a01ec05745d43", "exact_match": 0.0}

It looks like you are doing zero-shot (presenting no examples prior asking the question), this results in the model not being conditioned to respond with a letter (instead an explicit response) and hence the exact match fails. There is no way to solve that with an exact-match, you will need to create a new test definition for zero shot and probable code a different metric (like a quasi-exact-match). If there is no important reason for you to use zero-shot, I would suggest you to add --num_fewshots 3.

Got it, many thanks for your help.

@RawthiL
Copy link
Contributor

RawthiL commented Nov 18, 2024

Opened a PR to fix this :
#2503

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working.
Projects
None yet
Development

No branches or pull requests

5 participants