From ed5a11c291e1988e3a86d74a3fba99be9ed6f57f Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?X=CE=BBRI-U5?= <b3f0cus@icloud.com>
Date: Mon, 8 Jul 2024 17:05:47 +0700
Subject: [PATCH] Update README.md

---
 examples/doremi/README.md | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/examples/doremi/README.md b/examples/doremi/README.md
index 5a726bd1..dfc9ea40 100644
--- a/examples/doremi/README.md
+++ b/examples/doremi/README.md
@@ -87,3 +87,7 @@ For evaluation, we do uniform sampling on the test set to evaluate a 2.5B model
 - 2.5B llama trained using the optimized weights: https://huggingface.co/nanotron/doremi-llama-2.5b-optimized-weights
 
 and the dataset: https://huggingface.co/datasets/nanotron/the-pile-for-doremi
+
+#### Thoughts
+
+For DoReMi, it's useful if you don't initially have an idea of what would be a good distribution for your training data, or want a quick way to find a better baseline than the uniform distribution if you want to tune the data distribution by hand. In my previous experiments, DoReMi matched the pretraining performance of the distribution of mamba training but couldn't outperform it. I suspect it doesn't work well when there are nuances, meaning the difference between your known best distribution and a better distribution isn't significant.