Server creates a CPU buffer no matter the VRAM usage for 72B models #11012
                  
                    
                      DrVonSinistro
                    
                  
                
                  started this conversation in
                General
              
            Replies: 1 comment 1 reply
-
| 
 This is CUDA waiting for the device to finish work: https://forums.developer.nvidia.com/t/100-cpu-usage-when-running-cuda-code/35920 It's normal. | 
Beta Was this translation helpful? Give feedback.
                  
                    1 reply
                  
                
            
  
    Sign up for free
    to join this conversation on GitHub.
    Already have an account?
    Sign in to comment
  
        
    
Uh oh!
There was an error while loading. Please reload this page.
-
QWEN2.5 32B Q8 loads in GPU and it creates something called CUDA_Host which has few mb of something in it. No significant CPU usage during prompt processing and inference.
QWEN2.5 72B Q6-5-4-2 loads all fully in GPU but even if VRAM is only half full, it always create and fill a CPU buffer in which it puts 600-800mb of something in it.
Then I get one single CPU core that work like hell on that thing during prompt processing and inference.. Its very annoying. I tried everything. Plz send help.
Beta Was this translation helpful? Give feedback.
All reactions