-
Notifications
You must be signed in to change notification settings - Fork 516
Closed
Description
This is an overview issue on how to speed up MTEB:
I see the following options for speeding up MTEB:
- Implementing an encode cache: As suggested in Aggregating MMTEB datasets #354, as some datasets repeat it would be possible to implement a cache to not re-embed duplicates
- change loading of Multilingual dataset: Seems like there is a large overhead in loading multiple datasets of different languages (High overhead when loading lots of subsets from the same dataset huggingface/datasets#6800)
- Downsampling datasets: Most datasets could probably work with notably fewer samples.
- At the moment we download all splits even though we only use some of them. A solution might be to supply the split to the
load_datasetfunction. Note this will lead to bugs if thedataset_transformassumes the full dataset (probably shouldn't happen, but it might).
Task-specific speed-ups:
- Clustering: Clustering currently works by performing N clustering steps M samples (M could vary for each N), however, an alternative approach is embedding K samples and then sampling M from those samples N times. This would allow K << N x M, which would lead to a significant speed-up.
Overview of slowest segments:
Based on existing results from the paraphrase-multilingual-MiniLM-L12-v2 (which might have been run on all sorts of systems).
{'Reranking': {'mean': 118.08999999999999,
'n': 6,
'total': 708.54,
'median': 19.61,
'min': 2.59,
'max': 636.61,
'name_of_max': 'MindSmallReranking',
'name_of_min': 'AskUbuntuDupQuestions'},
'STS': {'mean': 2.3514285714285714,
'n': 14,
'total': 32.92,
'median': 2.44,
'min': 0.74,
'max': 4.46,
'name_of_max': 'STS17',
'name_of_min': 'STS22'},
'PairClassification': {'mean': 3.89625,
'n': 8,
'total': 31.17,
'median': 2.75,
'min': 0.72,
'max': 12.94,
'name_of_max': 'TwitterURLCorpus',
'name_of_min': 'CDSC-E'},
'Clustering': {'mean': 118.72272727272725,
'n': 22,
'total': 2611.8999999999996,
'median': 63.93,
'min': 0.54,
'max': 817.54,
'name_of_max': 'ArxivClusteringP2P',
'name_of_min': 'MasakhaNEWSClusteringS2S'},
'Classification': {'mean': 38.81444444444443,
'n': 18,
'total': 698.6599999999999,
'median': 13.51,
'min': 1.51,
'max': 340.71,
'name_of_max': 'AmazonPolarityClassification',
'name_of_min': 'PolEmo2.0-OUT'},
'BitextMining': {'mean': 287.15999999999997,
'n': 2,
'total': 574.3199999999999,
'median': 533.51,
'min': 40.81,
'max': 533.51,
'name_of_max': 'BUCC',
'name_of_min': 'Tatoeba'},
None: {'mean': 17.4225,
'n': 4,
'total': 69.69,
'median': 23.14,
'min': 1.22,
'max': 36.72,
'name_of_max': 'CQADupstackRetrieval',
'name_of_min': 'PPC'},
'Summarization': {'mean': 9.53,
'n': 2,
'total': 19.06,
'median': 15.84,
'min': 3.22,
'max': 15.84,
'name_of_max': 'SummEval',
'name_of_min': 'SummEvalFr'},
'Retrieval': {'mean': 559.1546341463416,
'n': 41,
'total': 22925.340000000004,
'median': 31.85,
'min': 0.31,
'max': 3808.37,
'name_of_max': 'MSMARCO-PL',
'name_of_min': 'SyntecRetrieval'}}
taeminlee and loicmagne
Metadata
Metadata
Assignees
Labels
No labels