diff --git a/examples/doremi/README.md b/examples/doremi/README.md index 5a726bd1..dfc9ea40 100644 --- a/examples/doremi/README.md +++ b/examples/doremi/README.md @@ -87,3 +87,7 @@ For evaluation, we do uniform sampling on the test set to evaluate a 2.5B model - 2.5B llama trained using the optimized weights: https://huggingface.co/nanotron/doremi-llama-2.5b-optimized-weights and the dataset: https://huggingface.co/datasets/nanotron/the-pile-for-doremi + +#### Thoughts + +For DoReMi, it's useful if you don't initially have an idea of what would be a good distribution for your training data, or want a quick way to find a better baseline than the uniform distribution if you want to tune the data distribution by hand. In my previous experiments, DoReMi matched the pretraining performance of the distribution of mamba training but couldn't outperform it. I suspect it doesn't work well when there are nuances, meaning the difference between your known best distribution and a better distribution isn't significant.