Skip to content

Conversation

@BenjaminBossan
Copy link
Member

@BenjaminBossan BenjaminBossan commented Nov 4, 2025

The training script of the MetaMathQA PEFT method comparison was calling cuda.empty_cache() and gc.collect() after each step. However, this is not really needed and it also slows down training considerably.

It turns out that gc.collect() is not needed at all when it comes to memory and thus it has been removed. This results in a big improvement in runtime. As for empty_cache(), not calling it at all leads to an increase in memory usage, but it's not necessary to call it every step. It is instead called every 10th step.

Improvement (tested locally, 250 steps):

  • Removing gc.collect()
    • 108 sec => 65 sec
    • memory reserved max stays the same (19.3 GB)
    • memory reserved 99th percentile stays the same (18.0 GB)
    • memory reserved avg stays the same (12.0 GB)
  • Also calling empty_cache() only every 10 steps
    • 65 sec => 50 sec
    • memory reserved max stays the same (19.3 GB)
    • memory reserved 99th percentile: 18.0 GB => 19.3 GB
    • memory reserved avg: 12.0 GB => 14.5 GB
  • Not calling empty_cache() at all:
    • 65 sec => 45 sec
    • memory reserved max: 19.3 GB => 23.5 GB
    • memory reserved 99th percentile: 18.0 GB => 23.5 GB
    • memory reserved avg: 12.0 GB => 22.1 GB

Thus gc.collect() can be safely removed, but removing empty_cache() completely is not advisable. Calling empty_cache() only every 10th step does increase average memory usage, but the peak is unaffected, which is what's most important in this benchmark, so it is a worthwhile tradeoff for the 23% speed improvement we get.

I also tested how manually deleting certain torch variables (batch, output, loss) would affect training but could not see any difference.

While working on this PR, I also removed an obsolete comment and an unused variable.

Note to maintainers: If this is merged, all MetaMathQA benchmarks should be re-run.

The training script of the MetaMathQA PEFT method comparison was calling
cuda.empty_cache() and gc.collect() after each step. However, this is
not really needed and it also slows down training considerably.

It turns out that gc.collect() is not needed at all and thus it has been
removed. This results in a big improvement in runtime. As for
empty_cache(), not calling it at all leads to an increase in memory
usage, but it's not necessary to call it every step. It is instead
called every 10th step.

Improvement (tested locally, 250 steps):

- Removing gc.collect()
  - 108 sec => 65 sec
  - memory reserved max stays the same (19.3 GB)
  - memory reserved 99th percentile stays the same (18.0 GB)
  - memory reserved avg stays the same (12.0 GB)
- Also calling empty_cache() only every 10 steps
  - 65 sec => 50 sec
  - memory reserved max stays the same (19.3 GB)
  - memory reserved avg: 18.0 GB => 19.3 GB
  - memory reserved avg: 12.0 GB => 14.5 GB

Thus gc.collect() can be safely removed. And while calling empty_cache()
only every 10th step does increase average memory usage, the peak is
unaffected, which is what's most important in this benchmark, so it is a
worthwhile tradeoff for the 23% speed improvement we get.

Note to maintainers: If this is merged, all MetaMathQA benchmarks should
be re-run.
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

}
print_verbose(json.dumps(log_dict))

# # TODO is this needed?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the comment can be removed if we don't get any more insights :)

Copy link
Collaborator

@githubnemo githubnemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for investigating, this is a really nice find.

If there are no new discoveries I think this can be merged.

@BenjaminBossan BenjaminBossan merged commit 90d3fc0 into huggingface:main Nov 13, 2025
23 of 25 checks passed
@BenjaminBossan BenjaminBossan deleted the enh-improve-metamath-training-runtime branch November 13, 2025 11:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants