Skip to content

Conversation

@rattus128
Copy link
Contributor

@rattus128 rattus128 commented Oct 23, 2025

Implement RAM Pressure cache

Implement a cache sensitive to RAM pressure. When RAM headroom drops
down below a certain threshold, evict RAM-expensive nodes from the
cache.

Models and tensors are measured directly for RAM usage. An OOM score
is then computed based on the RAM usage of the node and workflow staleness.

Note the due to indirection through shared objects (like a model
patcher), multiple nodes can account the same RAM as their individual
usage. The intent is this will free chains of nodes particularly
model loaders and associate loras as they all score similar and are
sorted in close to each other.

Has a helpful bias towards unloading model nodes mid flow while being able
to keep results like text encodings and VAE.

Example:

Linux, 64GB RAM system, RTX3060
WAN I2V template with FP16 Models
python main.py --novram --cache-ram 32.0

NOTE: You want to set the headroom significantly greater than your largest model.

Screenshot from 2025-10-24 00-39-19

At this point in time, it is running the low noise of WAN I2V after evicting the high noise.

After running another trivial workflow, and then returning the WAN, it recommences at the model loading and is able to use the cached TE and VAE results. The first nodes run are the UNETLoader (for high noise) -> Lora -> ModelSampling -> KSampler.

Currently the UI cache is parallel to the output cache with
expectations of being a content superset of the output cache.
At the same time the UI and output cache are maintained completely
seperately, making it awkward to free the output cache content without
changing the behaviour of the UI cache.

There are two actual users (getters) of the UI cache. The first is
the case of a direct content hit on the output cache when executing a
node. This case is very naturally handled by merging the UI and outputs
cache.

The second case is the history JSON generation at the end of the prompt.
This currently works by asking the cache for all_node_ids and then
pulling the cache contents for those nodes. all_node_ids is the nodes
of the dynamic prompt.

So fold the UI cache into the output cache. The current UI cache setter
now writes to a prompt-scope dict. When the output cache is set, just
get this value from the dict and tuple up with the outputs.

When generating the history, simply iterate prompt-scope dict.

This prepares support for more complex caching strategies (like RAM
pressure caching) where less than 1 workflow will be cached and it
will be desirable to keep the UI cache and output cache in sync.
Implement a cache sensitive to RAM pressure. When RAM headroom drops
down below a certain threshold, evict RAM-expensive nodes from the
cache.

Models and tensors are measured directly for RAM usage. An OOM score
is then computed based on the RAM usage of the node.

Note the due to indirection through shared objects (like a model
patcher), multiple nodes can account the same RAM as their individual
usage. The intent is this will free chains of nodes particularly
model loaders and associate loras as they all score similar and are
sorted in close to each other.

Has a bias towards unloading model nodes mid flow while being able
to keep results like text encodings and VAE.
@Kosinkadink Kosinkadink marked this pull request as ready for review October 24, 2025 23:47
@Kosinkadink Kosinkadink self-requested a review as a code owner October 24, 2025 23:47
@Kosinkadink Kosinkadink added the Core Core team dependency label Oct 24, 2025
@Kosinkadink
Copy link
Collaborator

@guill hey, would you be able to take a look to see if some of the changes (like ui cache being removed) seem all good?

@asagi4
Copy link
Contributor

asagi4 commented Oct 26, 2025

I tried this and it seems to have some problems with either graph expansion or subgraphs.
I have a workflow where I get this:

!!! Exception during processing !!! 'NoneType' object is not subscriptable
Traceback (most recent call last):
  File "/home/sd/git/ComfyUI/execution.py", line 445, in execute
    node_output = execution_list.get_output_cache(source_node, unique_id)[source_output]
                  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^
TypeError: 'NoneType' object is not subscriptable

Adding some debug logging and a dumb workaround to force the execution to continue and it logs this:

Output cache is none for source_node='322:166.0.0.1' unique_id='322:166'
Output cache is none for source_node='322:166.0.0.1' unique_id='322:166'

Node 322:166 is my PCLazyLoRALoader node that dynamically expands into a LoRALoader. With the workaround it fails with an NoneType error later because the output of the node becomes None.

I'll try to see if I can get a simpler workflow to fail.

@asagi4
Copy link
Contributor

asagi4 commented Oct 26, 2025

No subgraphs involved, it fails even with this simple workflow (--ram-cache 28 with 32GB of RAM on the host and 24GB of VRAM); works with the default cache.
fail_cache.json
In that workflow I get the following debug print:

Output cache is none for source_node='3.0.0.1' unique_id='3'

Just using LazyLoRALoader with a Preview Any node for the model output doesn't seem to be enough to trigger it, so I'm not sure what exactly the problem is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Core Core team dependency

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants