Skip to content

Conversation

@comfyanonymous
Copy link
Owner

I honestly have no idea why this improves things but it does.

I honestly have no idea why this improves things but it does.
@comfyanonymous comfyanonymous merged commit a125cd8 into master Oct 12, 2025
12 checks passed
@comfyanonymous comfyanonymous deleted the comfyanonymous-patch-2 branch October 12, 2025 04:28
@RandomGitUser321
Copy link
Contributor

It's related to MIOpen: ROCm/TheRock#1542

I'm not 100% positive, but I think torch.backends.cudnn.enabled = False just disables it or something along those lines. Last I remember, I think they are looking into some of the issues that cause things like slow VAE encode/decodes, so this bandaid fix might only be needed temporarily.

toxicwind pushed a commit to toxicwind/ComfyUI that referenced this pull request Oct 12, 2025
I honestly have no idea why this improves things but it does.
@daniandtheweb
Copy link

daniandtheweb commented Oct 12, 2025

When using flash-attention (built from the ROCm repo, main_perf branch) this commit slows inference by a lot on my 7800XT:

  • Previously: flash-attn sdxl 1024x1024 ~2 it/s
  • Now: flash-attn sdxl 1024x1024 ~1.5it/s

Maybe this new behavior can be disabled in case flash-attention is used?

@ThGrSoRu
Copy link

This causes an OOM crash on my Linux 6700XT setup when VAE decoding starts.
I've just tested it by undoing only this commit, after which image generation was successful again.

@sfinktah
Copy link

sfinktah commented Oct 13, 2025

@comfyanonymous

Please reverse this commit. We have done extensive tensing over a period of months, I can assure you that your assertion is incorrect.

Disabling cudnn only accelerates VAE Decoding and VAE Encoding, and has detrimental side effects to other operations, including slowing down initial ZLUDA compiles by a literal order of magntitude (we're talking almost an hour here). Though I realise ZLUDA will not be affected by this commit, the detrimental effects of blanket disabling of cudnn are equally applicable to native AMD pytorch installations.

image

Please see the many highly detailed graphs showing timing attached to the PR below, where we briefly toyed with the identical patch that you have just implemented.

patientx#272

We have developed many solutions for dealing with slow VAE decodes, the latest being this custom extension that dynamically toggles off cudnn during VAE encode and decoding only when using an AMD gpu.

https://github.com/sfinktah/ovum-cudnn-wrapper

I am also the developer of a timing node, https://github.com/sfinktah/comfy-ovum which allows me to compare the performance of every element of a workflow with and without cudnn (cudnn off is red, cudnn on is green).

image

That example (which I plucked at random) shows how different nodes react either positively or negatively to having cudnn disabled.

I can prepare detailed graphs showing timing and memory usage for any workflow you care to nominate, on any platform you care to nominate, if that proves necessary.

A more useful AMD helper would optionally replicate the functionality of my aforementioned cudnn-wrapper within your python core code.

@ExtraCubedPotato
Copy link

RX7900 XT user here, just adding on to the pile that I'm also experiencing issues after updating to this commit. Gens that would normally take 30~ seconds to finish are taking 43~ seconds. (It was also causing OOM on vae decode and went to tiled)

Reverting the file completely to what it was 3 weeks ago brought back expected performance.

@RandomGitUser321
Copy link
Contributor

RandomGitUser321 commented Oct 13, 2025

I think it heavily depends on which wheels you are using and which OS you're on. Like if you're on Linux, you might be using TheRock or the official PyTorch ones. On Windows, you have a lot of people using different methods like ZLUDA, the older Scott wheels or TheRock wheels, or people might be using WSL with various Linux wheels, etc etc.

In my testing, I found using the following as my run.bat script to work most reliably(Windows 11 and TheRock nightly wheels with a 7900xt):

set TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1
set MIOPEN_FIND_MODE=FAST
set MIOPEN_LOG_LEVEL=3
.\python_embeded\python.exe -s ComfyUI/main.py --disable-smart-memory --use-pytorch-cross-attention
pause

You may or may not need --fp32-vae depending on the model you're working with, or you might get all black decodes. You can also use a node to switch the VAE like Kijai's VAE loader node. As far as I know though, you have to make sure set TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 is enabled and that --use-pytorch-cross-attention is there as well. I found settting MIOpen to fast there to work best, though I'm not 100% positive of the impact it might have on things like VRAM usage. The log level is just there to hush any potential console spam.

Using the above config:
With torch.backends.cudnn.enabled = False: 17.6GB VRAM usage while diffusing = 84s.
Commenting it out: 14.7GB VRAM usage while diffusing = 83s.
The images are slightly different, due to the way these kinds of optimizations work.

So it seems like torch.backends.cudnn.enabled = False is causing it to consume a lot more VRAM than before. This might be what's causing issues for some.

Oh and this is with:

pytorch version: 2.10.0a0+rocm7.10.0a20251013
AMD arch: gfx1100
ROCm version: (7, 1)

@comfyanonymous
Copy link
Owner Author

This makes AMD experience on par with Nvidia and is a massive user experience improvement so I'm not reverting this. I also don't see any slowdowns or memory usage increase for SDXL or any other models I have tried on my RDNA 3 and RDNA 4 setups.

@alexheretic
Copy link
Contributor

I noticed upscale image with model performance regressed with this patch: ~5s/it -> ~32s/it which is quite significant when upscaling videos.

I actually noticed a similar regression when I tried rocm 7, which is why I'm still using 6.4.

Total VRAM 16368 MB, total RAM 64217 MB                                                                                                                                   
pytorch version: 2.9.0.dev20250827+rocm6.4                                                                                                                                
AMD arch: gfx1100                                                                                                                                                         
ROCm version: (6, 4)                                                                                                                                                      
Set vram state to: NORMAL_VRAM                                                                                                                                            
Device: cuda:0 AMD Radeon RX 7900 GRE : native                                                                                                                            
Using Flash Attention                                                                                                                                                     
Python version: 3.12.11 (main, Jun  4 2025, 10:32:37) [GCC 15.1.1 20250425]                                                                                               
ComfyUI version: 0.3.65                                                                                                                                                   
ComfyUI frontend version: 1.27.10

@RandomGitUser321
Copy link
Contributor

~5s/it -> ~32s/it which is quite significant when upscaling videos.

You probably ran out of VRAM, which might be related to my earlier post. It's also possible that something went from being fp16 to bf16 or fp32, which would make it take up more VRAM.

gmaOCR pushed a commit to gmaOCR/ComfyUI that referenced this pull request Oct 14, 2025
I honestly have no idea why this improves things but it does.
@alexheretic
Copy link
Contributor

It seems related to torch.backends.cudnn.enabled, though I'm not sure what effect this has exactly. Previously I had upscale image with model perf ~5s/it.

Now:

  • torch.backends.cudnn.enabled = False -> ~32s/it
  • torch.backends.cudnn.enabled = True -> ~5s/it

@xzuyn
Copy link

xzuyn commented Oct 14, 2025

This makes AMD experience on par with Nvidia and is a massive user experience improvement so I'm not reverting this.

There should be a flag to re-enable it though.

@sfinktah
Copy link

sfinktah commented Oct 15, 2025

@comfyanonymous

This makes AMD experience on par with Nvidia and is a massive user experience improvement so I'm not reverting this. I also don't see any slowdowns or memory usage increase for SDXL or any other models I have tried on my RDNA 3 and RDNA 4 setups.

I trust you have nothing against some healthy peer review? If you would care to send me your WAN 2.2 workflow, GPU specs (for the RDNA 3), and time taken? WAN 2.2 seems the most relevant model these days.

Did any of your tests include an "Upscale with Model" with ERSGAN as the model, or (even better) one of these:
image

I'm not sure that one (FILM VFI -- film_net_fp32.pt) will even actually complete without cuDNN, but if you have issues, the RIFE version of that node is more forgiving about such things.

That said, those timings and observations were made with ZLUDA, and I would want to do more testing on a native ROCm system before commiting myself. And since your workflows appear to be the stick by which we must measure these things, I will await those before commencing.

@sfinktah
Copy link

It seems related to torch.backends.cudnn.enabled, though I'm not sure what effect this has exactly. Previously I had upscale image with model perf ~5s/it.

Now:

  • torch.backends.cudnn.enabled = False -> ~32s/it
  • torch.backends.cudnn.enabled = True -> ~5s/it

Within the ZLUDA userbase, we have noted resizing and upscaling are the two things that really slow down without cuDNN. Though as I just said to comfyanonymous, I haven't tested that in a native ROCm environment. At least with ZLUDA, one can reason that it has something to do with the cudnn emulation -- the explanation for its similar effect on native AMD systems is a little harder to explain (assuming you are running native AMD, of course).

@comfyanonymous
Copy link
Owner Author

I have tested SDXL (standard 1024x1024 workflow), lumina 2.0 (neta yume 3.5 workflow) and flux-dev (standard workflow).

I tried the wan models but on an MI300X where setting cudnn to False also improved the first run and didn't seem to slow down anything.

All my tests were on nightly pytorch rocm 7.0 from the pytorch website on Linux.

For windows I tried the nightly AMD wheel for the strix halo (you can find the way to install it in the comfyui readme).

adlerfaulkner pushed a commit to LucaLabsInc/ComfyUI that referenced this pull request Oct 16, 2025
I honestly have no idea why this improves things but it does.
@alexheretic
Copy link
Contributor

alexheretic commented Oct 16, 2025

I can try re-testing with latest rocm/pytorch. I do use MIOPEN_FIND_MODE=FAST, it was suggested earlier in the thread that disabling cudnn disables miopen. If so it's possibly a better default to set MIOPEN_FIND_MODE rather than cudnn. See #10302 (comment)

@alexheretic
Copy link
Contributor

alexheretic commented Oct 16, 2025

I did some tests using latest https://rocm.nightlies.amd.com/v2/gfx110X-dgpu pytorch. I think this can help explain the difference of opinion in this thread.

It seems newer pytorch+rocm versions have a significant regression in performance that is somewhat mitigated by disabling cudnn. However, for earlier pytorch+rocm this mitigation has a negative effect as the performance is much better with default cudnn.

Note: I used a smaller test video so the numbers are slightly different than those I mentioned earlier.

pytorch version: 2.10.0a0+rocm7.10.0a20251015

  • cudnn default: ImageUpscaleWithModel 81.14s/it
  • cudnn = False: ImageUpscaleWithModel 12.64s/it

pytorch version: 2.9.0.dev20250827+rocm6.4

  • cudnn default: ImageUpscaleWithModel 1.84s/it
  • cudnn = False: ImageUpscaleWithModel 11.95s/it

So it may make sense to disable cudnn only for newer pytorch/rocm versions and to ask upstream to find the root cause of the regression so cudnn doesn't need to be disabled anywhere.

@RandomGitUser321
Copy link
Contributor

It seems newer pytorch+rocm versions have a significant regression in performance that is somewhat mitigated by disabling cudnn. However, for earlier pytorch+rocm this mitigation has a negative effect as the performance is much better with default cudnn.

If you're using Windows, that's probably due to the older versions not having AOTriton enabled. It only recently got merged into things for Windows by default. Before, you had to manually build for it with specific args. MIOpen is a part of that mix, as far as I know.

@sfinktah
Copy link

I have tested SDXL (standard 1024x1024 workflow), lumina 2.0 (neta yume 3.5 workflow) and flux-dev (standard workflow).

I tried the wan models but on an MI300X where setting cudnn to False also improved the first run and didn't seem to slow down anything.

All my tests were on nightly pytorch rocm 7.0 from the pytorch website on Linux.

For windows I tried the nightly AMD wheel for the strix halo (you can find the way to install it in the comfyui readme).

Totally aside from the discussion about cudnn, for which I bow to @alexheretic, I'd be genuinely interested in comparing the performance of WAN2.2 between the two extremes of nightly rocm on Linux (you), and pytorch 2.7/hip 6.2/windows/zluda (me). Provided you have an RDNA 3 card that is roughly equal to my 7900 XTX. Or perhaps someone else reading this can obligue me?

Also, if you have achieved good RDNA 4 performance on any platform, there are always people raising issues re: ZLUDA about gfx1200 and gfx1201, even gfx1151 (strix). As of about 2 months ago, their reports of Linux support were not inspiring. Since people only complain when things don't work, it would be nice to know the current state of play wrt Linux ROCm.

@alexheretic
Copy link
Contributor

If you're using Windows

I'm using Linux.

@sfinktah
Copy link

It seems newer pytorch+rocm versions have a significant regression in performance that is somewhat mitigated by disabling cudnn. However, for earlier pytorch+rocm this mitigation has a negative effect as the performance is much better with default cudnn.

If you're using Windows, that's probably due to the older versions not having AOTriton enabled. It only recently got merged into things for Windows by default. Before, you had to manually build for it with specific args. MIOpen is a part of that mix, as far as I know.

That's interesting to know. Honestly I have avoided trying any of the pytorch 7 based builds for Windows beacuse there's no MIOpen.dll in AMD's ROCm SDK, ergo no triton and no torch.induction. That holds true for 6.4 unfortunately, unless there have been recent developments.

@sfinktah
Copy link

This makes AMD experience on par with Nvidia and is a massive user experience improvement so I'm not reverting this.

There should be a flag to re-enable it though.

If you care to search the ComfyUI repo for ovum or grab it from https://github.com/sfinktah/comfy-ovum there is a node specifically for enabling and disabling cudnn. It has the added bonus of being tied into my Timer node, which will record performance in red or green for cudnn disabled and enabled respectively.

image

Or you can just deploy a single node at the start of your workflow so that cudnn is in a known state. There is also a AMD/NVIDIA switching node (haven't really tested that one) if you want to get conditional about it.

image

@xzuyn
Copy link

xzuyn commented Oct 16, 2025

If only the VAE encode/decode is the problem, maybe it would be best to run the encode/decode stuff with torch.backends.cudnn.flags(enabled=False): like this instead of completely disabling it?

@sfinktah
Copy link

sfinktah commented Oct 18, 2025

@comfyanonymous we've heard from a number of people who have pointed out that this patch is only a net benefit for specific versions of rocm on specific platforms. will you not consider

  1. altering the patch to either specifically target those cases; or
  2. printing a notification to console that cudnn has been disabled and this behavior can ben altered by an ENV var; or
  3. anything that doesn't silently make a substantial number of AMD users even more frustrated with owning an AMD?

@lostdisc
Copy link

On my end, I can confirm this does eliminate SDXL's really slow first run for a given image resolution (30-60 mins of mostly VAE, depending on image size); see also #5759 and ROCm issue 5754. Fixing that is a big win for first impressions. But I can sympathize with the desire for a more-targeted fix, if it's affecting other setups and use cases.

I see bf16-vae is also automatically enabling for me now, which is handy; I was using the launch flag before.

I'm on a Ryzen AI HX 370 (Strix Point) on Windows, with ROCm 6.4.4 PyTorch from AMD repos, following these instructions.

Image-generation time has become shorter for large pics, even though VAE tiling kicks in at a smaller image size. Previously, if I assigned 24GB RAM (out of 32 total) to the GPU, I could often do 1600x1280 and 1920x1088 (similar areas) without tiling, and a 30-step gen took about 300s (5 min). Now it always resorts to tiling for those sizes, but the time is down to about 220s (<4 min).

Interestingly, if I change the CPU/GPU RAM split to 16/16 instead, I can do 1600x1280 without tiling, and it shaves off a bit more time. However, attempting 1920x1280 with a 16/16 split hits the RAM ceiling and starts freezing Windows, whereas an 8/24 split is able to safely switch to tiled VAE and finish in about 300s. Before the update, 8/24 could do 1920x1280 without tiling, but it took something like 360-380s, if I recall correctly.

1024x1024 remained pretty much the same speed before/after the update, at about 1 min for 20 steps.

I notice one of the other commits for this release improved the memory estimation for VAE on AMD by adding a 2.73x multiplier when AMD is detected. Does that correspond to the increased VAE memory demands from this change?

@comfyanonymous
Copy link
Owner Author

I notice one of the other commits for this release #10334 by adding a 2.73x multiplier when AMD is detected. Does that correspond to the increased VAE memory demands from this change?

No, the VAE memory estimation has always been off on AMD.

@comfyanonymous we've heard from a number of people who have pointed out that this patch is only a net benefit for specific versions of rocm on specific platforms. will you not consider

With the cudnn flag enabled the VAE and the upscale models makes my computer unusable for a few minutes the first time I run it with a specific resolution, until they fix that it's staying off.

@sfinktah
Copy link

sfinktah commented Oct 19, 2025

@comfyanonymous

With the cudnn flag enabled the VAE and the upscale models makes my computer unusable for a few minutes the first time I run it with a specific resolution, until they fix that it's staying off.

Well, it's like... your repo... but could you add a logger output line so that people know it's happening, and perhaps an override? This is what I added to zluda.py in the end, you can default it the other way around obviously.

    # This needs to be up here, so it can disable cudnn before anything can even think about using it
    torch.backends.cudnn.enabled = os.environ.get("TORCH_BACKENDS_CUDNN_ENABLED", "1").strip().lower() not in {"0", "off", "false", "disable", "disabled", "no"}
    if torch.backends.cudnn.enabled:
        print("  ::  Enabled cuDNN")
    else:
        print("  ::  Disabled cuDNN")

I can adapt and submit that as a PR if you are willing? Just lmk what to name the environment variable.

@Only8Bits
Copy link

With the cudnn flag enabled the VAE and the upscale models makes my computer unusable for a few minutes the first time I run it with a specific resolution, until they fix that it's staying off.

There is nothing to fix here. It's how MIOpen works on AMD. Reference: https://rocm.docs.amd.com/projects/MIOpen/en/latest/how-to/find-and-immediate.html

Short version: AMD has 2 card types, pro and consumer. On all cards the default behaviour is to search for optimal compute solution to any given matrix problem the first time it is ecountered - that takes both time and VRAM. After that it'll be stored in local database and looked up so it's fast. The problem is MIOpen database is per-version, that is once the version is bumped the old database is ignored and a new one is started. And each new major ROCm release usually has new MIOpen version. There is a database of optimal compute solutions for pro cards but not consumer ones (maybe it'll change someday but it hasn't for years) so Radeons need to run a lot of computations to figure out what is what - though even for pro cards the AMD database does not have 100% coverage for all possible matrix sizes and data types.

Rather than breaking the whole app you can set MIOPEN_FIND_MODE="FAST" (or 2) on AMD to just skip the solution search on each new problem. On pro cards you won't even feel it much since these are pretty well profiled by AMD, so any long-term performance loss from suboptimal solution picks are minimal. On Radeons that search is quite time consuming (esp. for training and bigger matrix sizes in SDXL, since noboby has the VRAM to run WAN in 720p mode locally on Radeons and these problems tend to be smaller and faster to find) but it usually results in 5% or so reduced compute times long term once an optimal solution is found.

TLDR: Use MIOPEN_FIND_MODE or just accept that (sometimes massive) slowdown on each first time for a few % better performance long term. It's not ComfyUI issue, it's how MIOpen works. And consumer cards suffer the most since there is no global database from AMD, but also tend to find better solutions than the default ones so it's usually worth it.

@derfasthirnlosenick
Copy link

@comfyanonymous

With the cudnn flag enabled the VAE and the upscale models makes my computer unusable for a few minutes the first time I run it with a specific resolution, until they fix that it's staying off.

Well, it's like... your repo... but could you add a logger output line so that people know it's happening, and perhaps an override? This is what I added to zluda.py in the end, you can default it the other way around obviously.

    # This needs to be up here, so it can disable cudnn before anything can even think about using it
    torch.backends.cudnn.enabled = os.environ.get("TORCH_BACKENDS_CUDNN_ENABLED", "1").strip().lower() not in {"0", "off", "false", "disable", "disabled", "no"}
    if torch.backends.cudnn.enabled:
        print("  ::  Enabled cuDNN")
    else:
        print("  ::  Disabled cuDNN")

I can adapt and submit that as a PR if you are willing? Just lmk what to name the environment variable.

Yeah I think that's the most sensible solution. Reenabling cudnn gave my simple sdxl workflow a boost from 1.15it/s to 1.6... (6800xt)

@lostdisc
Copy link

Looks like comfyanon added a log message for v0.3.66, at least.

@sfinktah
Copy link

sfinktah commented Oct 23, 2025

Well, that's something. Unfortunately it seems I can't make a PR as I already have an actively developed ZLUDA fork of ComfyUI, and github won't let me make another. I have submitted a PR under another account. #10463

try:
    arch = torch.cuda.get_device_properties(get_torch_device()).gcnArchName
    if not (any((a in arch) for a in AMD_RDNA2_AND_OLDER_ARCH)):
        torch.backends.cudnn.enabled = os.environ.get("TORCH_AMD_CUDNN_ENABLED", "0").strip().lower() not in {
            "0", "off", "false", "disable", "disabled", "no"}
        if not torch.backends.cudnn.enabled:
            logging.info(
                "ComfyUI has set torch.backends.cudnn.enabled to False for better AMD performance. Set environment var TORCH_AMD_CUDDNN_ENABLED=1 to enable it again.")

@alexheretic
Copy link
Contributor

I've proposed #10448 which has the benefit of being easy to override. However, this requires some testing to verify that setting MIOPEN_FIND_MODE=FAST is a good enough alternative. On my Linux 7900gre rocm6.4 + flash-attn setup, at least, it is.

comfy-ovum pushed a commit to comfy-ovum/ComfyUI that referenced this pull request Oct 24, 2025
@Only8Bits
Copy link

Only8Bits commented Oct 24, 2025

MIOPEN_FIND_MODE=FAST only "fixes" the very first run on new MIOpen version install. Nothing else. Sadly it does not fix the VAE issue in ComfyUI - I've run some tests with WanWrapper:

System: Kubuntu 24.04 + ROCm 7.0.2 + AMD DKMS driver + Pytorch 2.8 from ROCm
Hardware: 7900XT + 5800X3D, 32G RAM

  • --cache-none, Pytorch SDPA used

    • model Wan2.2-I2V-A14B-Q8_0 GGUF (10/40), steps 10+10, Triton JIT, len 81, 832x480
      • 19m:44s (118.47s/it) + 19m:58s (119.83s/it), total 1h:08m:33s
    • model Wan2.2-I2V-A14B-Q8_0 GGUF (25/40), steps 10+10, Triton JIT, len 81, 960x720
      • 1h:04m:49s (388.91s/it) + 1h:05m:24s (392.42s/it), total 3h:03m:51s
  • --cache-none, SageAttention used

    • model I2V-A14B-Q6_K GGUF + LoRA, steps 12+12, JIT, len 81, 608x480
      • 10m:38s (53.20s/it) + 10m:19s (51.61s/it), total 22m:41s [VAE=10.80s+16.43s]
    • model I2V-A14B-Q6_K GGUF (10/40) + LoRA, steps 12+12, JIT, len 81, 832x480
      • 17m:41s (88.45s/it) + 17m:44s (88.73s/it), total 1h:05m:05s [VAE=651.28ss+1048.36s]
  • --cache-none, SageAttention with torch.backends.cudnn.enabled = False

  • model I2V-A14B-Q6_K GGUF (8/40) + LoRA, steps 12+12, JIT, len 81, 832x480
    • 18m:50s (94.18s/it) + 17m:53s (89.49s/it), total 38m:22s [VAE=8.74s+13.81s]

EDIT: Small explanation, (xx/40) means that xx blocks were RAM swapped to fit the project into VRAM. This affects the inference times somewhat.

This used WanWrapper so all custom nodes, it's not a specific issue with Comfy VAE node. But it is tied to VAE as the inference times seem unaffected by torch.backends.cudnn.enabled state. The difference is rather massive, long VAE times can cause the run to be 2x longer - but mostly for bigger resolutions. For 608x480 and 480x480 VAE seems still reasonable even with cudnn enabled.

So it could be ROCm issue or Pytorch issue, but I would guess it has to do with mixed precision used in VAE? I've tried VAE in both BF16 and FP16 (cast from FP32 model) and there doesn't seem to be much difference.

TLDR: On RDNA3/Linux at least torch.backends.cudnn.enabled must be False if you want reasonable VAE times. Seems like a bug deeper in the libraries rather than Comfy issue.

@sfinktah
Copy link

@Only8Bits we (the ZLUDA community) use a cudNN disabling wrapper or node pair to disable cudNN during VAE decoding. It's distributed with the install procedure, but I replicated a stand-alone version that works quite handily. See #10302 (comment)

We have also had MIOPEN_FIND_MODE=2 as part of the standard launch scripts for a while, and you are very correct in what you say: VAE decoding is slower with cudNN enabled. I can't offer any opinions on the "why" of it, though I have often thought Flux/Chroma's fp32 VAE works surprisingly fast (about 1.5 seconds).

We do have a node that automatically disables cudNN only for VAE operations (and only on AMD), but it's new and doesn't always work. Wrapping other nodes is a tricky business, and could probably be much better done from the Python side. ovum-cudnn-wrapper. It does however look pretty and comes preset with the smarts to work out what nodes it needs to wrap.

image

It's currently pending it's tick of wonderfullness in the ComfyUI repo, I can only assume because it modifies other nodes.

@alexheretic
Copy link
Contributor

I do tend to use tiled VAE everywhere, including for wan encodes as untiled VAE perf can indeed be bad. My tiled vae perf seems ok with cudnn on. If I have time I'll try to test cudnn-off's effect on untiled VAE.

@sfinktah
Copy link

sfinktah commented Oct 25, 2025

@alexheretic i have a lovely node timer that will mark things red or green based on whether they were running with VAE decode on or off, as long as you are using the included cuDNN toggler anyway. The difference is probably about 50% speed increase, if it exists. comfy-ovum in the registry and the Timer node. Incredibly handy, as it keeps a history of the last 100 runs, lets you add notes, and has a JSON restful API for retrieving that data to use in fancy graphs.

And oh yes, TILED VAE always... it's on my list to build an automatic "convert to tiled vae" doover, because sometimes I forget and don't notice that it's taking 217 seconds to do VAE decoding. Talking of which, I need to go fix a workflow!

image

What nobody has bought up in this conversation is the other cudnn field, cudnn.benchmark (set to false by default, so not generally used by AMD or NVIDIA). I guess it does something similar to the MIOPEN flag, so possibly it has no affect on AMD.

@alexheretic
Copy link
Contributor

VAE benches for wan & sdxl

System info

Using #7764, #10238 on 426cde3

Total VRAM 16368 MB, total RAM 64217 MB
pytorch version: 2.9.0+rocm6.4
AMD arch: gfx1100
ROCm version: (6, 4)
Set vram state to: NORMAL_VRAM
Device: cuda:0 AMD Radeon RX 7900 GRE : native
Using Flash Attention
Python version: 3.13.7 (main, Aug 15 2025, 12:34:02) [GCC 15.2.1 20250813]
ComfyUI version: 0.3.66
ComfyUI frontend version: 1.28.7

env vars

export TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1
export FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE

pytorch version: 2.9.0+rocm6.4

With cudnn disabled wan untiled encode/decode hits oom and fallsback to tiled. However, this is slower
than explicitly using 256 tiles (e.g. using #10238 for encodes + VAEDecodeTiled).
256-tile performance is about the same on/off.

On sdxl performance is about the same too, except for untiled decode. Here cudnn-on is slow ~27s
and cudnn-off OOMs and falls back to tiled. However in either case it is faster to explicitly use
tiled with the default 512-tile decode ~3s on or off.

Conclusion:

  • Wan: cudnn can be left enabled, users should use tiling vae ✔️
  • sdxl: cudnn can be left enabled, users should use tiling vae decode ✔️
Results

wan untiled cudnn off

Note: Hits oom fast and falls back to tiled (note: fallback tiled is a bit slower than 256 tiled).

Warning: Ran out of memory when regular VAE encoding, retrying with tiled VAE encoding.
Warning: Ran out of memory when regular VAE decoding, retrying with tiled VAE decoding.

wan tiled 256 cudnn off

[WanImageToVideo]: 16.43s
[WanImageToVideo]: 22.69s

[VAEDecodeTiled]: 25.92s
[VAEDecodeTiled]: 39.56s

wan tiled 256 cudnn on

[WanImageToVideo]: 16.37s
[WanImageToVideo]: 20.11s

[VAEDecodeTiled]: 28.21s
[VAEDecodeTiled]: 35.36s

sdxl 1280x1832 vae cudnn off

[VAEEncode]: 0.45s
[VAEEncode]: 0.45s
[VAEEncode]: 0.45s

[VAEEncodeTiled]: 1.96s
[VAEEncodeTiled]: 1.62s
[VAEEncodeTiled]: 1.66s

Warning: Ran out of memory when regular VAE decoding, retrying with tiled VAE decoding.
[VAEDecode]: 5.41s
[VAEDecode]: 5.48s
[VAEDecode]: 5.51s

[VAEDecodeTiled]: 3.21s
[VAEDecodeTiled]: 3.26s
[VAEDecodeTiled]: 3.24s

sdxl 1280x1832 vae cudnn on

[VAEEncode]: 0.61s
[VAEEncode]: 0.45s
[VAEEncode]: 0.45s

[VAEEncodeTiled]: 2.05s
[VAEEncodeTiled]: 1.55s
[VAEEncodeTiled]: 1.56s

[VAEDecode]: 26.84s
[VAEDecode]: 26.49s
[VAEDecode]: 26.67s

[VAEDecodeTiled]: 3.41s
[VAEDecodeTiled]: 3.14s
[VAEDecodeTiled]: 3.15s

pytorch version: 2.10.0a0+rocm7.10.0a2025101

For wan it's a similar story to rocm6.4. However, for sdxl encode performance is generally
worse with cudnn-on.

Conclusion:

  • Wan: cudnn can be left enabled, users should use tiling vae ✔️
  • sdxl: cudnn should be disabled
Results

wan untiled cudnn off

Note: Hits oom fast and falls back to tiled (note: fallback tiled is a bit slower than 256 tiled).

Warning: Ran out of memory when regular VAE encoding, retrying with tiled VAE encoding.
Warning: Ran out of memory when regular VAE decoding, retrying with tiled VAE decoding.

wan tiled 256 cudnn off

[WanImageToVideo]: 16.31s
[WanImageToVideo]: 19.36s

[VAEDecodeTiled]: 25.67s
[VAEDecodeTiled]: 38.41s

wan tiled 256 cudnn on

[WanImageToVideo]: 17.26s
[WanImageToVideo]: 20.49s

[VAEDecodeTiled]: 34.65s
[VAEDecodeTiled]: 34.42s

sdxl 1280x1832 vae cudnn off

[VAEEncode]: 0.67s
[VAEEncode]: 0.53s
[VAEEncode]: 0.59s

[VAEEncodeTiled]: 1.92s
[VAEEncodeTiled]: 1.52s
[VAEEncodeTiled]: 1.54s

Warning: Ran out of memory when regular VAE decoding, retrying with tiled VAE decoding.
[VAEDecode]: 5.30s
[VAEDecode]: 5.34s
[VAEDecode]: 5.41s

[VAEDecodeTiled]: 3.18s
[VAEDecodeTiled]: 3.17s
[VAEDecodeTiled]: 3.22s

sdxl 1280x1832 vae cudnn on

[VAEEncode]: 10.21s
[VAEEncode]: 9.84s
[VAEEncode]: 9.77s

[VAEEncodeTiled]: 22.60s
[VAEEncodeTiled]: 22.68s
[VAEEncodeTiled]: 22.76s

[VAEDecode]: 26.76s
[VAEDecode]: 26.61s
[VAEDecode]: 26.59s

[VAEDecodeTiled]: 3.19s
[VAEDecodeTiled]: 3.15s
[VAEDecodeTiled]: 3.14s

An interesting difference between having cudnn off is untiled VAE tends to OOM faster log Warning: Ran out of memory when regular VAE decoding, retrying with tiled VAE decoding. and try tiled version automatically. Whereas cudnn on doesn't OOM it just takes a long time. This doesn't seem like a good reason to disable cudnn though to me. Instead perhaps RDNA3 should default to tiled VAE generally instead of forcing users to always configure it in their workflows.

@lostdisc
Copy link

lostdisc commented Oct 25, 2025

Interesting results, thanks! For comparison, I did some SDXL VAE testing on my APU, all with bf16 vae and 1280x1600 dimensions:

Untiled VAE decode, cudnn enabled: ~100s, ~10GB VRAM
Tiled VAE decode, cudnn enabled: ~15s, ~3GB VRAM

Untiled VAE decode, cudnn disabled: ~5s, ~19GB VRAM
Tiled VAE decode, cudnn disabled: ~15s, ~3GB RAM

So yeah, enabling cudnn halves the RAM for untiled VAE, but takes MUCH longer to run. And that's not even the extra-long first run for this resolution, which was more like 30 minutes.

Tiled VAE uses similar time and RAM with/without cudnn, but still has the slow-first-run issue with cudnn enabled. (IIRC it's not as bad as fullsize though, presumably since the tile size stays the same for different image dimensions, except maybe at the picture edges.)

My system is a Ryzen AI 9 HX 370 with a Radeon 890M iGPU (gfx1150). 32GB RAM with 16GB assigned to the GPU (and another 8GB shareable. Oddly, assigning 24GB makes it fall back to tiled decoding for the cudnn-disabled case).
ComfyUI 0.3.66, Python 3.12.11, ROCm 6.4.4, and Pytorch 2.8.0 on Windows 11.

@alexheretic
Copy link
Contributor

the extra-long first run for this resolution, which was more like 30 minutes

This sounds like maybe the key issue and reason for disabling cudnn. I didn't reproduce it in my setup though. For me the downside of disabling cudnn is #10447, so I was hoping for a better solution than this.

As earlier suggested by others, maybe change so cudnn is disabled only during vae? And/or maybe add some arg/env var to control this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.