-
Couldn't load subscription status.
- Fork 10.4k
Improve AMD performance. #10302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve AMD performance. #10302
Conversation
I honestly have no idea why this improves things but it does.
|
It's related to MIOpen: ROCm/TheRock#1542 I'm not 100% positive, but I think |
I honestly have no idea why this improves things but it does.
|
When using flash-attention (built from the ROCm repo, main_perf branch) this commit slows inference by a lot on my 7800XT:
Maybe this new behavior can be disabled in case flash-attention is used? |
|
This causes an OOM crash on my Linux 6700XT setup when VAE decoding starts. |
|
Please reverse this commit. We have done extensive tensing over a period of months, I can assure you that your assertion is incorrect. Disabling
Please see the many highly detailed graphs showing timing attached to the PR below, where we briefly toyed with the identical patch that you have just implemented. We have developed many solutions for dealing with slow VAE decodes, the latest being this custom extension that dynamically toggles off cudnn during VAE encode and decoding only when using an AMD gpu. https://github.com/sfinktah/ovum-cudnn-wrapper I am also the developer of a timing node, https://github.com/sfinktah/comfy-ovum which allows me to compare the performance of every element of a workflow with and without cudnn (cudnn off is red, cudnn on is green).
That example (which I plucked at random) shows how different nodes react either positively or negatively to having cudnn disabled. I can prepare detailed graphs showing timing and memory usage for any workflow you care to nominate, on any platform you care to nominate, if that proves necessary. A more useful AMD helper would optionally replicate the functionality of my aforementioned cudnn-wrapper within your python core code. |
|
RX7900 XT user here, just adding on to the pile that I'm also experiencing issues after updating to this commit. Gens that would normally take 30~ seconds to finish are taking 43~ seconds. (It was also causing OOM on vae decode and went to tiled) Reverting the file completely to what it was 3 weeks ago brought back expected performance. |
|
I think it heavily depends on which wheels you are using and which OS you're on. Like if you're on Linux, you might be using TheRock or the official PyTorch ones. On Windows, you have a lot of people using different methods like ZLUDA, the older Scott wheels or TheRock wheels, or people might be using WSL with various Linux wheels, etc etc. In my testing, I found using the following as my run.bat script to work most reliably(Windows 11 and TheRock nightly wheels with a 7900xt): You may or may not need --fp32-vae depending on the model you're working with, or you might get all black decodes. You can also use a node to switch the VAE like Kijai's VAE loader node. As far as I know though, you have to make sure set TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 is enabled and that --use-pytorch-cross-attention is there as well. I found settting MIOpen to fast there to work best, though I'm not 100% positive of the impact it might have on things like VRAM usage. The log level is just there to hush any potential console spam. Using the above config: So it seems like torch.backends.cudnn.enabled = False is causing it to consume a lot more VRAM than before. This might be what's causing issues for some. Oh and this is with: |
|
This makes AMD experience on par with Nvidia and is a massive user experience improvement so I'm not reverting this. I also don't see any slowdowns or memory usage increase for SDXL or any other models I have tried on my RDNA 3 and RDNA 4 setups. |
|
I noticed upscale image with model performance regressed with this patch: ~5s/it -> ~32s/it which is quite significant when upscaling videos. I actually noticed a similar regression when I tried rocm 7, which is why I'm still using 6.4. |
You probably ran out of VRAM, which might be related to my earlier post. It's also possible that something went from being fp16 to bf16 or fp32, which would make it take up more VRAM. |
I honestly have no idea why this improves things but it does.
|
It seems related to Now:
|
There should be a flag to re-enable it though. |
Within the ZLUDA userbase, we have noted resizing and upscaling are the two things that really slow down without cuDNN. Though as I just said to comfyanonymous, I haven't tested that in a native ROCm environment. At least with ZLUDA, one can reason that it has something to do with the cudnn emulation -- the explanation for its similar effect on native AMD systems is a little harder to explain (assuming you are running native AMD, of course). |
|
I have tested SDXL (standard 1024x1024 workflow), lumina 2.0 (neta yume 3.5 workflow) and flux-dev (standard workflow). I tried the wan models but on an MI300X where setting cudnn to False also improved the first run and didn't seem to slow down anything. All my tests were on nightly pytorch rocm 7.0 from the pytorch website on Linux. For windows I tried the nightly AMD wheel for the strix halo (you can find the way to install it in the comfyui readme). |
I honestly have no idea why this improves things but it does.
|
I can try re-testing with latest rocm/pytorch. I do use |
|
I did some tests using latest https://rocm.nightlies.amd.com/v2/gfx110X-dgpu pytorch. I think this can help explain the difference of opinion in this thread. It seems newer pytorch+rocm versions have a significant regression in performance that is somewhat mitigated by disabling cudnn. However, for earlier pytorch+rocm this mitigation has a negative effect as the performance is much better with default cudnn. Note: I used a smaller test video so the numbers are slightly different than those I mentioned earlier. pytorch version: 2.10.0a0+rocm7.10.0a20251015
pytorch version: 2.9.0.dev20250827+rocm6.4
So it may make sense to disable cudnn only for newer pytorch/rocm versions and to ask upstream to find the root cause of the regression so cudnn doesn't need to be disabled anywhere. |
If you're using Windows, that's probably due to the older versions not having AOTriton enabled. It only recently got merged into things for Windows by default. Before, you had to manually build for it with specific args. MIOpen is a part of that mix, as far as I know. |
Totally aside from the discussion about cudnn, for which I bow to @alexheretic, I'd be genuinely interested in comparing the performance of WAN2.2 between the two extremes of nightly rocm on Linux (you), and pytorch 2.7/hip 6.2/windows/zluda (me). Provided you have an RDNA 3 card that is roughly equal to my 7900 XTX. Or perhaps someone else reading this can obligue me? Also, if you have achieved good RDNA 4 performance on any platform, there are always people raising issues re: ZLUDA about gfx1200 and gfx1201, even gfx1151 (strix). As of about 2 months ago, their reports of Linux support were not inspiring. Since people only complain when things don't work, it would be nice to know the current state of play wrt Linux ROCm. |
I'm using Linux. |
That's interesting to know. Honestly I have avoided trying any of the pytorch 7 based builds for Windows beacuse there's no MIOpen.dll in AMD's ROCm SDK, ergo no triton and no torch.induction. That holds true for 6.4 unfortunately, unless there have been recent developments. |
If you care to search the ComfyUI repo for
Or you can just deploy a single node at the start of your workflow so that cudnn is in a known state. There is also a AMD/NVIDIA switching node (haven't really tested that one) if you want to get conditional about it.
|
|
If only the VAE encode/decode is the problem, maybe it would be best to run the encode/decode stuff |
|
@comfyanonymous we've heard from a number of people who have pointed out that this patch is only a net benefit for specific versions of rocm on specific platforms. will you not consider
|
|
On my end, I can confirm this does eliminate SDXL's really slow first run for a given image resolution (30-60 mins of mostly VAE, depending on image size); see also #5759 and ROCm issue 5754. Fixing that is a big win for first impressions. But I can sympathize with the desire for a more-targeted fix, if it's affecting other setups and use cases. I see bf16-vae is also automatically enabling for me now, which is handy; I was using the launch flag before. I'm on a Ryzen AI HX 370 (Strix Point) on Windows, with ROCm 6.4.4 PyTorch from AMD repos, following these instructions. Image-generation time has become shorter for large pics, even though VAE tiling kicks in at a smaller image size. Previously, if I assigned 24GB RAM (out of 32 total) to the GPU, I could often do 1600x1280 and 1920x1088 (similar areas) without tiling, and a 30-step gen took about 300s (5 min). Now it always resorts to tiling for those sizes, but the time is down to about 220s (<4 min). Interestingly, if I change the CPU/GPU RAM split to 16/16 instead, I can do 1600x1280 without tiling, and it shaves off a bit more time. However, attempting 1920x1280 with a 16/16 split hits the RAM ceiling and starts freezing Windows, whereas an 8/24 split is able to safely switch to tiled VAE and finish in about 300s. Before the update, 8/24 could do 1920x1280 without tiling, but it took something like 360-380s, if I recall correctly. 1024x1024 remained pretty much the same speed before/after the update, at about 1 min for 20 steps. I notice one of the other commits for this release improved the memory estimation for VAE on AMD by adding a 2.73x multiplier when AMD is detected. Does that correspond to the increased VAE memory demands from this change? |
No, the VAE memory estimation has always been off on AMD.
With the cudnn flag enabled the VAE and the upscale models makes my computer unusable for a few minutes the first time I run it with a specific resolution, until they fix that it's staying off. |
Well, it's like... your repo... but could you add a logger output line so that people know it's happening, and perhaps an override? This is what I added to zluda.py in the end, you can default it the other way around obviously. # This needs to be up here, so it can disable cudnn before anything can even think about using it
torch.backends.cudnn.enabled = os.environ.get("TORCH_BACKENDS_CUDNN_ENABLED", "1").strip().lower() not in {"0", "off", "false", "disable", "disabled", "no"}
if torch.backends.cudnn.enabled:
print(" :: Enabled cuDNN")
else:
print(" :: Disabled cuDNN")I can adapt and submit that as a PR if you are willing? Just lmk what to name the environment variable. |
There is nothing to fix here. It's how MIOpen works on AMD. Reference: https://rocm.docs.amd.com/projects/MIOpen/en/latest/how-to/find-and-immediate.html Short version: AMD has 2 card types, pro and consumer. On all cards the default behaviour is to search for optimal compute solution to any given matrix problem the first time it is ecountered - that takes both time and VRAM. After that it'll be stored in local database and looked up so it's fast. The problem is MIOpen database is per-version, that is once the version is bumped the old database is ignored and a new one is started. And each new major ROCm release usually has new MIOpen version. There is a database of optimal compute solutions for pro cards but not consumer ones (maybe it'll change someday but it hasn't for years) so Radeons need to run a lot of computations to figure out what is what - though even for pro cards the AMD database does not have 100% coverage for all possible matrix sizes and data types. Rather than breaking the whole app you can set MIOPEN_FIND_MODE="FAST" (or 2) on AMD to just skip the solution search on each new problem. On pro cards you won't even feel it much since these are pretty well profiled by AMD, so any long-term performance loss from suboptimal solution picks are minimal. On Radeons that search is quite time consuming (esp. for training and bigger matrix sizes in SDXL, since noboby has the VRAM to run WAN in 720p mode locally on Radeons and these problems tend to be smaller and faster to find) but it usually results in 5% or so reduced compute times long term once an optimal solution is found. TLDR: Use MIOPEN_FIND_MODE or just accept that (sometimes massive) slowdown on each first time for a few % better performance long term. It's not ComfyUI issue, it's how MIOpen works. And consumer cards suffer the most since there is no global database from AMD, but also tend to find better solutions than the default ones so it's usually worth it. |
Yeah I think that's the most sensible solution. Reenabling cudnn gave my simple sdxl workflow a boost from 1.15it/s to 1.6... (6800xt) |
|
Looks like comfyanon added a log message for v0.3.66, at least. |
|
Well, that's something. try:
arch = torch.cuda.get_device_properties(get_torch_device()).gcnArchName
if not (any((a in arch) for a in AMD_RDNA2_AND_OLDER_ARCH)):
torch.backends.cudnn.enabled = os.environ.get("TORCH_AMD_CUDNN_ENABLED", "0").strip().lower() not in {
"0", "off", "false", "disable", "disabled", "no"}
if not torch.backends.cudnn.enabled:
logging.info(
"ComfyUI has set torch.backends.cudnn.enabled to False for better AMD performance. Set environment var TORCH_AMD_CUDDNN_ENABLED=1 to enable it again.") |
|
I've proposed #10448 which has the benefit of being easy to override. However, this requires some testing to verify that setting |
…isablement of cudnn for all AMD users)
|
MIOPEN_FIND_MODE=FAST only "fixes" the very first run on new MIOpen version install. Nothing else. Sadly it does not fix the VAE issue in ComfyUI - I've run some tests with WanWrapper: System: Kubuntu 24.04 + ROCm 7.0.2 + AMD DKMS driver + Pytorch 2.8 from ROCm
EDIT: Small explanation, (xx/40) means that xx blocks were RAM swapped to fit the project into VRAM. This affects the inference times somewhat. This used WanWrapper so all custom nodes, it's not a specific issue with Comfy VAE node. But it is tied to VAE as the inference times seem unaffected by torch.backends.cudnn.enabled state. The difference is rather massive, long VAE times can cause the run to be 2x longer - but mostly for bigger resolutions. For 608x480 and 480x480 VAE seems still reasonable even with cudnn enabled. So it could be ROCm issue or Pytorch issue, but I would guess it has to do with mixed precision used in VAE? I've tried VAE in both BF16 and FP16 (cast from FP32 model) and there doesn't seem to be much difference. TLDR: On RDNA3/Linux at least torch.backends.cudnn.enabled must be False if you want reasonable VAE times. Seems like a bug deeper in the libraries rather than Comfy issue. |
|
@Only8Bits we (the ZLUDA community) use a cudNN disabling wrapper or node pair to disable cudNN during VAE decoding. It's distributed with the install procedure, but I replicated a stand-alone version that works quite handily. See #10302 (comment) We have also had MIOPEN_FIND_MODE=2 as part of the standard launch scripts for a while, and you are very correct in what you say: VAE decoding is slower with cudNN enabled. I can't offer any opinions on the "why" of it, though I have often thought Flux/Chroma's fp32 VAE works surprisingly fast (about 1.5 seconds). We do have a node that automatically disables cudNN only for VAE operations (and only on AMD), but it's new and doesn't always work. Wrapping other nodes is a tricky business, and could probably be much better done from the Python side. ovum-cudnn-wrapper. It does however look pretty and comes preset with the smarts to work out what nodes it needs to wrap.
It's currently pending it's tick of wonderfullness in the ComfyUI repo, I can only assume because it modifies other nodes. |
|
I do tend to use tiled VAE everywhere, including for wan encodes as untiled VAE perf can indeed be bad. My tiled vae perf seems ok with cudnn on. If I have time I'll try to test cudnn-off's effect on untiled VAE. |
|
@alexheretic i have a lovely node timer that will mark things red or green based on whether they were running with VAE decode on or off, as long as you are using the included cuDNN toggler anyway. The difference is probably about 50% speed increase, if it exists. And oh yes, TILED VAE always... it's on my list to build an automatic "convert to tiled vae" doover, because sometimes I forget and don't notice that it's taking 217 seconds to do VAE decoding. Talking of which, I need to go fix a workflow!
What nobody has bought up in this conversation is the other cudnn field, |
|
VAE benches for wan & sdxl System infoUsing #7764, #10238 on 426cde3 env vars pytorch version: 2.9.0+rocm6.4With cudnn disabled wan untiled encode/decode hits oom and fallsback to tiled. However, this is slower On sdxl performance is about the same too, except for untiled decode. Here cudnn-on is slow ~27s Conclusion:
Resultswan untiled cudnn offNote: Hits oom fast and falls back to tiled (note: fallback tiled is a bit slower than 256 tiled). wan tiled 256 cudnn offwan tiled 256 cudnn onsdxl 1280x1832 vae cudnn offsdxl 1280x1832 vae cudnn onpytorch version: 2.10.0a0+rocm7.10.0a2025101For wan it's a similar story to rocm6.4. However, for sdxl encode performance is generally Conclusion:
Resultswan untiled cudnn offNote: Hits oom fast and falls back to tiled (note: fallback tiled is a bit slower than 256 tiled). wan tiled 256 cudnn offwan tiled 256 cudnn onsdxl 1280x1832 vae cudnn offsdxl 1280x1832 vae cudnn onAn interesting difference between having cudnn off is untiled VAE tends to OOM faster log |
|
Interesting results, thanks! For comparison, I did some SDXL VAE testing on my APU, all with bf16 vae and 1280x1600 dimensions: Untiled VAE decode, cudnn enabled: ~100s, ~10GB VRAM Untiled VAE decode, cudnn disabled: ~5s, ~19GB VRAM So yeah, enabling cudnn halves the RAM for untiled VAE, but takes MUCH longer to run. And that's not even the extra-long first run for this resolution, which was more like 30 minutes. Tiled VAE uses similar time and RAM with/without cudnn, but still has the slow-first-run issue with cudnn enabled. (IIRC it's not as bad as fullsize though, presumably since the tile size stays the same for different image dimensions, except maybe at the picture edges.) My system is a Ryzen AI 9 HX 370 with a Radeon 890M iGPU (gfx1150). 32GB RAM with 16GB assigned to the GPU (and another 8GB shareable. Oddly, assigning 24GB makes it fall back to tiled decoding for the cudnn-disabled case). |
This sounds like maybe the key issue and reason for disabling cudnn. I didn't reproduce it in my setup though. For me the downside of disabling cudnn is #10447, so I was hoping for a better solution than this. As earlier suggested by others, maybe change so cudnn is disabled only during vae? And/or maybe add some arg/env var to control this. |







I honestly have no idea why this improves things but it does.