-
Notifications
You must be signed in to change notification settings - Fork 144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
.Net How to free GPU memory after each inference #1131
Comments
Our current design keeps the OrtAllocator cuda allocator until you exit. So the cuda memory pool will not decrease to zero until that point. We could potentially have a way to release this allocator if no objects are allocated from it. |
At present, there are some problems with this, in smaller GPU memory devices, it is not possible to inference efficiently multiple times, and the inference speed is getting slower and slower as GPU Memory approaches 100%. |
The memory shouldn't be growing every time, that might be a bug. Marked this as an enhancement & bug to track. |
I look forward to the next update, although it may take a while. |
Is this a C# issue or an issue with ort-genai in general with all the APIs? |
I am using Phi3.5mini-cuda-fp16 With A Nvida GPU (24G Memory).
When i load model Memory is 8490MiB in use.
When I entered an inference of about 3K tokens, the GPU Memory used 10580MiB
If I continue the conversation afterwards, GPU memory will continue to rise
If I am not having a conversation, even if I leave it for an hour, the memory will not decrease.
I don't know if this is a bug, as this phenomenon seems to have existed since 0.4, and the same goes for 0.5.2
Or did I miss something?
This is My code ,I did not forget to release any object, of course, the Model object was not released because we need to reuse it
The text was updated successfully, but these errors were encountered: