Support long context dataset accuracy measurement. #230

Lumosis · 2025-03-21T17:44:22Z

The result should be like:


Results

{'rougeL': 6.165264881957181, 'exact_match': 0.0, 'gen_len': 59242, 'gen_num': 50}

vipannalla · 2025-03-21T18:01:44Z

The results you pasted -- are they from an actual benchmark run on 405b? Can you paste the full results (as a screenshot or paste link)?

Lumosis · 2025-03-21T18:33:48Z

The results you pasted -- are they from an actual benchmark run on 405b? Can you paste the full results (as a screenshot or paste link)?

No, this is from a mock run. I am working on the actual benchmarking.

vipannalla · 2025-03-21T18:38:11Z

sounds good

mailvijayasingh · 2025-03-21T21:25:01Z

benchmarks/eval_accuracy_longcontext.py

+  return {"exact_match": round(score, 2)}
+
+
+def qa_em(label, pred):


please add some docustring about each functions.

benchmarks/benchmark_serving.py

vipannalla

Looks good

Lumosis requested a review from mailvijayasingh March 21, 2025 17:44

Lumosis requested a review from vipannalla as a code owner March 21, 2025 17:44

Lumosis force-pushed the lihao/accuracy branch from fb8bc6b to 353a7c5 Compare March 21, 2025 17:45

Lumosis force-pushed the lihao/accuracy branch from 353a7c5 to 07a443b Compare March 21, 2025 18:32

Lumosis force-pushed the lihao/accuracy branch 5 times, most recently from 91fbb46 to 877f8b2 Compare March 21, 2025 20:44

mailvijayasingh reviewed Mar 21, 2025

View reviewed changes

Lumosis force-pushed the lihao/accuracy branch 2 times, most recently from 4a56d13 to d0ef346 Compare March 24, 2025 23:43

Support long context dataset accuracy measurement

0f6b21d

Lumosis force-pushed the lihao/accuracy branch from d0ef346 to 0f6b21d Compare March 24, 2025 23:48

vipannalla approved these changes Mar 25, 2025

View reviewed changes

Lumosis merged commit 351462e into main Mar 25, 2025
2 of 3 checks passed

Lumosis deleted the lihao/accuracy branch March 25, 2025 21:33

Provide feedback

		return {"exact_match": round(score, 2)}


		def qa_em(label, pred):