Skip to content

Speeding up MTEB #381

@KennethEnevoldsen

Description

@KennethEnevoldsen

This is an overview issue on how to speed up MTEB:

I see the following options for speeding up MTEB:

  • Implementing an encode cache: As suggested in Aggregating MMTEB datasets #354, as some datasets repeat it would be possible to implement a cache to not re-embed duplicates
  • change loading of Multilingual dataset: Seems like there is a large overhead in loading multiple datasets of different languages (High overhead when loading lots of subsets from the same dataset huggingface/datasets#6800)
  • Downsampling datasets: Most datasets could probably work with notably fewer samples.
  • At the moment we download all splits even though we only use some of them. A solution might be to supply the split to the load_dataset function. Note this will lead to bugs if the dataset_transform assumes the full dataset (probably shouldn't happen, but it might).

Task-specific speed-ups:

  • Clustering: Clustering currently works by performing N clustering steps M samples (M could vary for each N), however, an alternative approach is embedding K samples and then sampling M from those samples N times. This would allow K << N x M, which would lead to a significant speed-up.

Overview of slowest segments:
Based on existing results from the paraphrase-multilingual-MiniLM-L12-v2 (which might have been run on all sorts of systems).

{'Reranking': {'mean': 118.08999999999999,
  'n': 6,
  'total': 708.54,
  'median': 19.61,
  'min': 2.59,
  'max': 636.61,
  'name_of_max': 'MindSmallReranking',
  'name_of_min': 'AskUbuntuDupQuestions'},
 'STS': {'mean': 2.3514285714285714,
  'n': 14,
  'total': 32.92,
  'median': 2.44,
  'min': 0.74,
  'max': 4.46,
  'name_of_max': 'STS17',
  'name_of_min': 'STS22'},
 'PairClassification': {'mean': 3.89625,
  'n': 8,
  'total': 31.17,
  'median': 2.75,
  'min': 0.72,
  'max': 12.94,
  'name_of_max': 'TwitterURLCorpus',
  'name_of_min': 'CDSC-E'},
 'Clustering': {'mean': 118.72272727272725,
  'n': 22,
  'total': 2611.8999999999996,
  'median': 63.93,
  'min': 0.54,
  'max': 817.54,
  'name_of_max': 'ArxivClusteringP2P',
  'name_of_min': 'MasakhaNEWSClusteringS2S'},
 'Classification': {'mean': 38.81444444444443,
  'n': 18,
  'total': 698.6599999999999,
  'median': 13.51,
  'min': 1.51,
  'max': 340.71,
  'name_of_max': 'AmazonPolarityClassification',
  'name_of_min': 'PolEmo2.0-OUT'},
 'BitextMining': {'mean': 287.15999999999997,
  'n': 2,
  'total': 574.3199999999999,
  'median': 533.51,
  'min': 40.81,
  'max': 533.51,
  'name_of_max': 'BUCC',
  'name_of_min': 'Tatoeba'},
 None: {'mean': 17.4225,
  'n': 4,
  'total': 69.69,
  'median': 23.14,
  'min': 1.22,
  'max': 36.72,
  'name_of_max': 'CQADupstackRetrieval',
  'name_of_min': 'PPC'},
 'Summarization': {'mean': 9.53,
  'n': 2,
  'total': 19.06,
  'median': 15.84,
  'min': 3.22,
  'max': 15.84,
  'name_of_max': 'SummEval',
  'name_of_min': 'SummEvalFr'},
 'Retrieval': {'mean': 559.1546341463416,
  'n': 41,
  'total': 22925.340000000004,
  'median': 31.85,
  'min': 0.31,
  'max': 3808.37,
  'name_of_max': 'MSMARCO-PL',
  'name_of_min': 'SyntecRetrieval'}}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions