Data-Selection-for-ZeroIR

This is the official code for the SDS method for data selection in ZeroIR.

SDS-ranker

We finetune the SDS-ranker on the subset of source data selected by our SDS method. The SDS-ranker is built on the monoT5 large model (see https://huggingface.co/castorini/monot5-large-msmarco).

We use BERRI, a collection of retrieval datasets from various tasks, as source datasets. The official BERRI can be found at https://github.com/facebookresearch/tart?tab=readme-ov-file#dataset-berri.
Since the SDS-ranker belongs to the cross-encoder architecture, we use the TART-full training data of BERRI.
According to the instruction texts, we classify each instance into its original dataset.
28 datasets are correctly classified, including AGNews, Altlex, CNN Daily Mail, etc. (see https://drive.google.com/drive/folders/1x0XhCIH8tanoz9WORgDLnELpjZWVXBfg?usp=sharing)
MS MARCO dataset is removed because it was used in the pretraining of monoT5, the initialization of our model.

We use 9 publicly available BEIR datasets. The BEIR can be found at https://github.com/beir-cellar/beir.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
evaluation		evaluation
stepwise_bo		stepwise_bo
trainers		trainers
utils		utils
.DS_Store		.DS_Store
README.md		README.md
run.sh		run.sh
train_ranker_by_stepwise_data_selection.py		train_ranker_by_stepwise_data_selection.py