- Task: Run a test on CQA to see how well GPT-4 can answer correctly, how well it generates evidence, and whether the evidence it generates helps improve the accuracy of its answers.
- Dataset: https://huggingface.co/datasets/tau/commonsense_qa
- Reference: https://www.notion.so/huaxiulab/Evidence-ranking-by-uncertainty-a9ff5ccc4f274adea7bccd9366f30560
- Task: Finetune GPT-2
- Task: Use qwen2-7b instead
- Task: Adjust the dataset
- Dataset: https://huggingface.co/datasets/allenai/openbookqa
- Task: Back to GPT-4o
- Task: Employ previous dataset
- Task: List the ans alone
- Task: Employ CoT, CoT with evidence
- Task: Calculate the length
- Task: Apply
temperature=0
- Task: Add two more baselines