Speeding up MTEB

**This is an overview issue on how to speed up MTEB:** 

I see the following options for speeding up MTEB:

- Implementing an encode cache: As suggested in #354, as some datasets repeat it would be possible to implement a cache to not re-embed duplicates
- change loading of Multilingual dataset: Seems like there is a large overhead in loading multiple datasets of different languages (https://github.com/huggingface/datasets/issues/6800)
- Downsampling datasets: Most datasets could probably work with notably fewer samples.
- At the moment we download all splits even though we only use some of them. A solution might be to supply the split to the `load_dataset` function. Note this will lead to bugs if the `dataset_transform` assumes the full dataset (probably shouldn't happen, but it might).

**Task-specific speed-ups:**
- Clustering: Clustering currently works by performing N clustering steps M samples (M could vary for each N), however, an alternative approach is embedding K samples and then sampling M from those samples N times. This would allow K <<  N x M, which would lead to a significant speed-up.


**Overview of slowest segments:**
Based on existing results from the `paraphrase-multilingual-MiniLM-L12-v2` (which might have been run on all sorts of systems).

```
{'Reranking': {'mean': 118.08999999999999,
  'n': 6,
  'total': 708.54,
  'median': 19.61,
  'min': 2.59,
  'max': 636.61,
  'name_of_max': 'MindSmallReranking',
  'name_of_min': 'AskUbuntuDupQuestions'},
 'STS': {'mean': 2.3514285714285714,
  'n': 14,
  'total': 32.92,
  'median': 2.44,
  'min': 0.74,
  'max': 4.46,
  'name_of_max': 'STS17',
  'name_of_min': 'STS22'},
 'PairClassification': {'mean': 3.89625,
  'n': 8,
  'total': 31.17,
  'median': 2.75,
  'min': 0.72,
  'max': 12.94,
  'name_of_max': 'TwitterURLCorpus',
  'name_of_min': 'CDSC-E'},
 'Clustering': {'mean': 118.72272727272725,
  'n': 22,
  'total': 2611.8999999999996,
  'median': 63.93,
  'min': 0.54,
  'max': 817.54,
  'name_of_max': 'ArxivClusteringP2P',
  'name_of_min': 'MasakhaNEWSClusteringS2S'},
 'Classification': {'mean': 38.81444444444443,
  'n': 18,
  'total': 698.6599999999999,
  'median': 13.51,
  'min': 1.51,
  'max': 340.71,
  'name_of_max': 'AmazonPolarityClassification',
  'name_of_min': 'PolEmo2.0-OUT'},
 'BitextMining': {'mean': 287.15999999999997,
  'n': 2,
  'total': 574.3199999999999,
  'median': 533.51,
  'min': 40.81,
  'max': 533.51,
  'name_of_max': 'BUCC',
  'name_of_min': 'Tatoeba'},
 None: {'mean': 17.4225,
  'n': 4,
  'total': 69.69,
  'median': 23.14,
  'min': 1.22,
  'max': 36.72,
  'name_of_max': 'CQADupstackRetrieval',
  'name_of_min': 'PPC'},
 'Summarization': {'mean': 9.53,
  'n': 2,
  'total': 19.06,
  'median': 15.84,
  'min': 3.22,
  'max': 15.84,
  'name_of_max': 'SummEval',
  'name_of_min': 'SummEvalFr'},
 'Retrieval': {'mean': 559.1546341463416,
  'n': 41,
  'total': 22925.340000000004,
  'median': 31.85,
  'min': 0.31,
  'max': 3808.37,
  'name_of_max': 'MSMARCO-PL',
  'name_of_min': 'SyntecRetrieval'}}
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Speeding up MTEB #381

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Speeding up MTEB #381

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions