README

First Task Aug. 19

Task: Run a test on CQA to see how well GPT-4 can answer correctly, how well it generates evidence, and whether the evidence it generates helps improve the accuracy of its answers.
Dataset: https://huggingface.co/datasets/tau/commonsense_qa
Reference: https://www.notion.so/huaxiulab/Evidence-ranking-by-uncertainty-a9ff5ccc4f274adea7bccd9366f30560

Adjustment Aug. 23

Task: Finetune GPT-2

Adjustment Aug. 26

Task: Use qwen2-7b instead

Adjustment Sep. 3

Task: Adjust the dataset
Dataset: https://huggingface.co/datasets/allenai/openbookqa

Adjustment Sep. 8

Task: Back to GPT-4o
Task: Employ previous dataset

Adjustment Sep. 10

Task: List the ans alone
Task: Employ CoT, CoT with evidence
Task: Calculate the length

Adjustment Oct. 30

Task: Apply temperature=0
Task: Add two more baselines