ENH: Improve MetaMath training script runtime (#2894)

BenjaminBossan · web-flow · commit 90d3fc05d7aa · 2025-11-13T12:20:58.000+01:00
The training script of the MetaMathQA PEFT method comparison was calling
cuda.empty_cache() and gc.collect() after each step. However, this is
not really needed and it also slows down training considerably.

It turns out that gc.collect() is not needed at all and thus it has been
removed. This results in a big improvement in runtime. As for
empty_cache(), not calling it at all leads to an increase in memory
usage, but it's not necessary to call it every step. It is instead
called every 10th step.

Improvement (tested locally, 250 steps):

- Removing gc.collect()
  - 108 sec =&gt; 65 sec
  - memory reserved max stays the same (19.3 GB)
  - memory reserved 99th percentile stays the same (18.0 GB)
  - memory reserved avg stays the same (12.0 GB)
- Also calling empty_cache() only every 10 steps
  - 65 sec =&gt; 50 sec
  - memory reserved max stays the same (19.3 GB)
  - memory reserved avg: 18.0 GB =&gt; 19.3 GB
  - memory reserved avg: 12.0 GB =&gt; 14.5 GB

Thus gc.collect() can be safely removed. And while calling empty_cache()
only every 10th step does increase average memory usage, the peak is
unaffected, which is what's most important in this benchmark, so it is a
worthwhile tradeoff for the 23% speed improvement we get.

Note to maintainers: If this is merged, all MetaMathQA benchmarks should
be re-run.
diff --git a/method_comparison/MetaMathQA/run.py b/method_comparison/MetaMathQA/run.py
@@ -18,7 +18,6 @@
 
 import argparse
 import datetime as dt
-import gc
 import json
 import os
 import random
@@ -58,12 +57,10 @@
 from peft.utils import CONFIG_NAME, infer_device
 
 
-# # suppress all warnings
-# warnings.filterwarnings("ignore") # FIXME?
-
-dtype_to_bytes_linear = {"float32": 4, "float16": 2, "bfloat16": 2, "int8": 1, "int4": 0.5}
-# if lr scheduler with warmup is used, the ratio of warmup steps to total steps
-BUCKET_FACTOR = 20  # number of batches per bucket, increasing this further has diminishing returns
+# number of batches per bucket, increasing this further has diminishing returns
+BUCKET_FACTOR = 20
+# empty device cache every N steps; 10 is a good compromise between keeping memory down while lowering runtime overhead
+ACCELERATOR_EMPTY_CACHE_SCHEDULE = 10
 
 
 def get_generation_config(*, seq_len, generate_kwargs) -> GenerationConfig:
@@ -298,9 +295,8 @@ def train(
                 }
                 print_verbose(json.dumps(log_dict))
 
-            # # TODO is this needed?
-            torch_accelerator_module.empty_cache()
-            gc.collect()
+            if step % ACCELERATOR_EMPTY_CACHE_SCHEDULE == 0:
+                torch_accelerator_module.empty_cache()
 
         print_verbose(f"Training finished after {max_steps} steps, evaluation on test set follows.")
         # test set evaluation