Framework for training state-of-the-art embedding models using Contrastive learning at large batch sizes.
- Create and activate a fresh conda environment
- Install required packages:
pip install -r requirements.txt
Prepare your datasets as jsonl files with the following columns:
query
: strpositive_doc
: strnegative_docs
: List[str] (not needed for pretraining)
Sample datasets:
- Pretraining:
resources/pretraining_data/*.jsonl
- Fine-tuning:
resources/finetuning_data/*.jsonl
Training requires pretokenized datasets stored as binary files. To tokenize your data:
# For pretraining data
python corgee/data/create_tokbins.py \
--tokenizer intfloat/multilingual-e5-base \
--input_dir resources/pretraining_data/ \
--output_dir resources/pretraining_data_tokenized/
# For fine-tuning data
python corgee/data/create_tokbins.py \
--tokenizer intfloat/multilingual-e5-base \
--input_dir resources/finetuning_data/ \
--output_dir resources/finetuning_data_tokenized/
-
Create a
config.yaml
file with relevant parameters.- Sample pretraining and finetuning configs are provided in the
configs/
directory.
- Sample pretraining and finetuning configs are provided in the
-
Start training:
For running on a single node:
source run.sh config.yaml
For running on multiple nodes (e.g., 4 nodes):
DIST_NUM_NODES=4 source run.sh config.yaml
Adjust the
DIST_NUM_NODES
value according to your setup. -
Parameter Configuration:
- Set parameters in
config.yaml
- Override important parameters via command line as needed
- Set parameters in
Sample configs are provided in configs/
Parameter | Description |
---|---|
output_dir |
Directory for logs and saved models |
batch_size |
Training batch size |
max_forward_batch_size |
Maximum batch size for GPU forwarding |
files |
Dictionary of dataset configurations |
Each dataset in the files
dictionary requires:
num_steps
: Number of training batches to samplemaxlen1
: Maximum tokens in querymaxlen2
: Maximum tokens in positive/negative documentsfile_pattern
: Regex pattern for tokbin files
Note: Batches are sampled from one dataset at a time. For language-wise sampling, make each language a separate dataset.