You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Accelerate's estimate-memory gives 64.73GB for meta-llama/Llama-3.1-70B-Instruct at fp16 precision instead of anticipated ~140GB, i.e. num_params * num_bytes_in_dtype. The later is basically overall recommendation for the memory capacity to accommodate inference. See for example https://huggingface.co/blog/llama31#inference-memory-requirements.
Is this a mistake by estimate-memory or there is some magic happening behind the scenes when Transformers load models which allows to save memory?
If estimate-memory indeed makes a mistake, that's worrisome since this might impact device_map=auto if it reuses estimate code to distribute layers across devices. At least estimate-memory does call regular accelerator's utils method:
With:
Accelerate's
estimate-memory
gives 64.73GB for meta-llama/Llama-3.1-70B-Instruct at fp16 precision instead of anticipated ~140GB, i.e.num_params * num_bytes_in_dtype
. The later is basically overall recommendation for the memory capacity to accommodate inference. See for example https://huggingface.co/blog/llama31#inference-memory-requirements.Is this a mistake by
estimate-memory
or there is some magic happening behind the scenes when Transformers load models which allows to save memory?If
estimate-memory
indeed makes a mistake, that's worrisome since this might impactdevice_map=auto
if it reuses estimate code to distribute layers across devices. At leastestimate-memory
does call regular accelerator's utils method:accelerate/src/accelerate/commands/estimate.py
Lines 21 to 22 in f076495
CC: @muellerzr @ArthurZucker @SunMarc
The text was updated successfully, but these errors were encountered: