[Invitation for a discussion] Much improved CPU memory management#11748
[Invitation for a discussion] Much improved CPU memory management#11748ifilipis wants to merge 20 commits intoComfy-Org:masterfrom
Conversation
|
Review the following changes in direct dependencies. Learn more about Socket for GitHub.
|
|
Looks like exactly what I've wanted for a long time, but unfortunately I can't get it to work. Running a basic SDXL workflow, I'm seeing a message indicating some kind of read failure:
from clip_text_transformers_convert I guess that might be a fastsafetensors problem. |
|
I figured it out. All the pop implementations add the key to self._deleted before calling self.get_tensor(key), so they always throw a KeyError After fixing that I can at least run SDXL and bf16 Chroma without errors. Looks like it still won't work with quantized models though, which is unfortunate. The disk loader seems to be able to keep RAM usage lower than normal Comfy, but it doesn't seem to be completely problem-free. At least for workflows where everything does fit into RAM, it seems to slow things down; it's quite noticeable in workflows that run the TE many times (ie. when doing prompt scheduling). |
|
One of the simpler fix would be accurate VAE decode/encode calculation, currenly before the VAE decode occurs, comfyui just removes the whole model from VRAM even though the decoding works within 4GB VRAM max with tiled decoding. |
The goal here is to be able to make generations while filling RAM and VRAM to the brink, not just VAE.
Yeah, still trying to figure it out. Tried running FP8 Flux, fixed weight loading in places, but there's something wrong with dtypes, and it's proving quite difficult to debug without knowing the backend |
Test Evidence CheckIf this PR changes user-facing behavior, visual proof (screen recording or screenshot) is required. PRs without applicable visual documentation may not be reviewed until provided. You can add it by:
|
Hi there,
I've been painfully trying to run LTX-2 in Colab on L4 - it could barely fit, only to produce a 20s 720p video in 18 minutes. Horrible result.
The dumbest thing about it was - the model and video would fit just fine in its 24+53 GB of memory, but because Comfy cannot unload the models partially (from RAM), it would spend 10 out of those 18 minutes unloading and reloading the text encoder and unet with --cache-none or pressure cache.
This doesn't make any sense whatsoever, especially given that with 53GB of RAM you're just missing out on a couple of GB. Unloading 27GB of weights to save 2 is insane.
So I went on to research what would it take to implement RAM memory management and came up with this. Not much.
What it does:
Benchmarks:
I know nothing about your architecture, but this exercise tells me that proper RAM memory management is entirely possible. And you probably won't even have to rely on fastsafetensors, since regular safetensors are also designed to allow partial weight loading from disk.
Y'all very welcome to clone and try it yourself