Skip to content

Commit 04e06ce

Browse files
committed
Add more dataset
1 parent 91eaa6c commit 04e06ce

File tree

2 files changed

+42
-0
lines changed

2 files changed

+42
-0
lines changed

.vscode/settings.json

+1
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@
99
"Aozora",
1010
"aozorabunko",
1111
"Aruno",
12+
"belebele",
1213
"bigcode",
1314
"binarized",
1415
"burkelibbey",
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
type: huggingface
2+
id: facebook/belebele
3+
url: https://huggingface.co/datasets/facebook/belebele
4+
converted_size: 44.1MB
5+
license: CC-BY-SA-4.0
6+
lang: multiple
7+
description: Belebele is a multiple-choice machine reading comprehension (MRC) dataset spanning 122 language variants. This dataset enables the evaluation of mono- and multi-lingual models in high-, medium-, and low-resource languages. Each question has four multiple-choice answers and is linked to a short passage from the FLORES-200 dataset. The human annotation procedure was carefully curated to create questions that discriminate between different levels of generalizable language comprehension and is reinforced by extensive quality checks. While all questions directly relate to the passage, the English dataset on its own proves difficult enough to challenge state-of-the-art language models. Being fully parallel, this dataset enables direct comparison of model performance across all languages. Belebele opens up new avenues for evaluating and analyzing the multilingual abilities of language models and NLP systems.
8+
structure:
9+
- id: question_number
10+
type: int64
11+
description: Number of the question
12+
- id: link
13+
type: string
14+
description: URL of the passage
15+
- id: flores_passage
16+
type: string
17+
description: Passage from the FLORES-200 dataset
18+
- id: question
19+
type: string
20+
description: Question
21+
- id: mc_answer1
22+
type: string
23+
description: Answer
24+
- id: mc_answer2
25+
type: string
26+
description: Answer
27+
- id: mc_answer3
28+
type: string
29+
description: Answer
30+
- id: mc_answer4
31+
type: string
32+
description: Answer
33+
- id: correct_answer_num
34+
type: string
35+
description: Correct answer
36+
- id: dialect
37+
type: string
38+
description: Dialect
39+
- id: ds
40+
type: string
41+
description: Timestamp

0 commit comments

Comments
 (0)