-
Notifications
You must be signed in to change notification settings - Fork 10.3k
Add RAM Pressure cache mode #10454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Add RAM Pressure cache mode #10454
Conversation
Currently the UI cache is parallel to the output cache with expectations of being a content superset of the output cache. At the same time the UI and output cache are maintained completely seperately, making it awkward to free the output cache content without changing the behaviour of the UI cache. There are two actual users (getters) of the UI cache. The first is the case of a direct content hit on the output cache when executing a node. This case is very naturally handled by merging the UI and outputs cache. The second case is the history JSON generation at the end of the prompt. This currently works by asking the cache for all_node_ids and then pulling the cache contents for those nodes. all_node_ids is the nodes of the dynamic prompt. So fold the UI cache into the output cache. The current UI cache setter now writes to a prompt-scope dict. When the output cache is set, just get this value from the dict and tuple up with the outputs. When generating the history, simply iterate prompt-scope dict. This prepares support for more complex caching strategies (like RAM pressure caching) where less than 1 workflow will be cached and it will be desirable to keep the UI cache and output cache in sync.
Implement a cache sensitive to RAM pressure. When RAM headroom drops down below a certain threshold, evict RAM-expensive nodes from the cache. Models and tensors are measured directly for RAM usage. An OOM score is then computed based on the RAM usage of the node. Note the due to indirection through shared objects (like a model patcher), multiple nodes can account the same RAM as their individual usage. The intent is this will free chains of nodes particularly model loaders and associate loras as they all score similar and are sorted in close to each other. Has a bias towards unloading model nodes mid flow while being able to keep results like text encodings and VAE.
4f4c551 to
f3f526f
Compare
|
@guill hey, would you be able to take a look to see if some of the changes (like ui cache being removed) seem all good? |
|
I tried this and it seems to have some problems with either graph expansion or subgraphs. Adding some debug logging and a dumb workaround to force the execution to continue and it logs this: Node 322:166 is my PCLazyLoRALoader node that dynamically expands into a LoRALoader. With the workaround it fails with an NoneType error later because the output of the node becomes None. I'll try to see if I can get a simpler workflow to fail. |
|
No subgraphs involved, it fails even with this simple workflow ( Just using LazyLoRALoader with a Preview Any node for the model output doesn't seem to be enough to trigger it, so I'm not sure what exactly the problem is. |
Implement RAM Pressure cache
Example:
Linux, 64GB RAM system, RTX3060
WAN I2V template with FP16 Models
python main.py --novram --cache-ram 32.0
NOTE: You want to set the headroom significantly greater than your largest model.
At this point in time, it is running the low noise of WAN I2V after evicting the high noise.
After running another trivial workflow, and then returning the WAN, it recommences at the model loading and is able to use the cached TE and VAE results. The first nodes run are the UNETLoader (for high noise) -> Lora -> ModelSampling -> KSampler.