feat: automatically fall back to VAE tiling when an untiled decode exceeds the backend buffer limit by RapidMark · Pull Request #1621 · leejet/stable-diffusion.cpp

RapidMark · 2026-06-08T12:33:12Z

On memory-constrained backends — integrated GPUs especially — a full-image VAE decode allocates a single compute buffer larger than the backend's maximum single-buffer/allocation size, and sd.cpp hard-fails instead of falling back to the tiling it already supports. The user has to know to pass --vae-tiling up front; otherwise the run crashes at the very end, after sampling has already completed.

Repro

AMD Radeon 8060S (Strix Halo, RDNA3.5 iGPU, 128 GB unified memory), Vulkan backend, Flux Krea-dev Q4 at 1024×1024, with no tiling flag:

[INFO ] stable-diffusion.cpp - sampling completed, taking 10.24s
ggml_vulkan: Requested buffer size exceeds device buffer size limit: ErrorOutOfDeviceMemory
[ERROR] ggml_extend.hpp - vae: failed to allocate the compute buffer
[ERROR] vae.hpp - vae decode compute failed
[ERROR] main.cpp - generate failed

The ~8.5 GB single-shot VAE decode buffer exceeds the iGPU's Vulkan per-buffer limit. The card has ample total memory (it shares 128 GB system RAM) — the failure is the per-buffer ceiling, not capacity. The whole gen is lost after a successful sampling pass.

Change

Add an automatic fallback to tiling, on by default, and keep it non-breaking:

--vae-tiling stays exactly as it was — a boolean flag that forces tiling on.
The auto-fallback is the new default. Before allocating the untiled decode buffer, the planned size is measured from the graph with ggml_gallocr_reserve_n_size (no-alloc planning, zero allocation) and compared against ggml_backend_buft_get_max_size(); if it won't fit, the decode goes straight to tiling. This is non-breaking — a decode that previously fit behaves identically, and one that previously OOM'd now recovers — and strictly safer. On CPU get_max_size() is SIZE_MAX, so it no-ops there.
--no-vae-tiling-fallback disables the fallback for anyone who wants the old hard-fail behavior.
A reactive backstop remains: if the untiled _compute still returns empty at runtime (e.g. the planned size fit the max but the device is genuinely full), it frees the buffer and retries once tiled — so a true OOM is also covered.

Implemented with a bool auto_tile appended to the end of sd_tiling_params_t (kept at the end so the C ABI stays backward-compatible; default true), the proactive probe in GGMLRunner::alloc_compute_buffer, and the fallback branch in VAE::decode.

Choosing the real graph-planned size (not a hardcoded bytes-per-pixel estimate) keeps it correct across every VAE architecture (SD/SDXL/Flux/Wan/LTX) and backend with no tuning.

Validation (AMD Radeon 8060S iGPU, Krea Q4, 1024²)

default (no flag) → vae: untiled decode buffer exceeded the backend limit; retrying with tiling, completes, exit 0
--no-vae-tiling-fallback → fails at decode, exit 1 (the old behavior, opt-in)
--vae-tiling → tiles from the start, exit 0

The tiled GPU decode (~6.9 s) is also far faster than the usual workaround of routing the VAE to CPU (~29.5 s) to dodge the OOM, and is visually equivalent at 0.5 tile overlap.

Helps any constrained device, not just iGPUs — an 8 GB discrete card at high resolution hits the same per-buffer wall. Scoped to decode (where the failure occurs); encode has the same shape and could get the identical treatment later.

Thanks to @wbruna for pushing toward the proactive graph-planned size, and @stduhpf for catching that the original tristate would have broken the --vae-tiling syntax (this revision keeps it a plain flag + auto-by-default + opt-out).

wbruna · 2026-06-08T15:18:55Z

I chose reactive (retry on the real allocation failure) over proactive (estimate the buffer size and compare against ggml_backend_buft_get_max_size()) deliberately: a size estimate is VAE-architecture-specific (peak activation differs across SD/SDXL/Flux/Wan/LTX VAEs), so a hardcoded bytes-per-pixel constant would be brittle

Wouldn't be possible to check with the real value, calculated from the graph before the allocation?

RapidMark · 2026-06-08T23:36:47Z

Good call — done (pushed just now).

Instead of retrying on failure, the AUTO path now measures the planned compute-buffer size up front with ggml_gallocr_reserve_n_size (no-alloc planning, zero allocation) and compares it to ggml_backend_buft_get_max_size() before allocating; if it won't fit, it goes straight to tiling — no bytes-per-pixel estimate. On CPU get_max_size() is SIZE_MAX, so it no-ops there.

I kept the original retry-on-empty as a backstop for a genuine runtime OOM (planned size fits the max, but the device is actually full). Net effect on the auto path: the backend no longer prints its raw "allocation failed" error — just an INFO line and the tiled decode.

Validated on an AMD Radeon 8060S iGPU (Krea Q4, 1024²): --vae-tiling auto now logs only the INFO "retrying with tiling" and completes; off still fails; on still tiles.

stduhpf · 2026-06-09T09:22:56Z

I think having a fallback to vae tiling is a much welcome addition, but I'm having some small issues with the user experience there. Modifying the syntax of --vae-tiling arg from a flag to a tistate option will break previously working commands, and I think we could implement the same feature without breakling anything.

For example we could add a --vae-tiling-auto flag.

Alternatively, set "auto" tiling as default and add something like a --no-auto-vae-tiling to disable it.

RapidMark · 2026-06-09T15:14:20Z

Thanks — agreed, changing --vae-tiling's syntax wasn't worth the breakage. Pushed a revision (your alternative B):

--vae-tiling is back to the original boolean flag (force tiling on).
The auto-fallback is now the default — it only tiles when an untiled decode would exceed the backend's max buffer size, so it's non-breaking and strictly safer for existing commands (an untiled decode that previously fit behaves identically; one that previously OOM'd now recovers).
--no-vae-tiling-fallback disables it for the old hard-fail behavior.

Validated on an AMD Radeon 8060S iGPU (Krea Q4, 1024²): default auto-recovers (logs vae: untiled decode buffer exceeded the backend limit; retrying with tiling, exit 0); --no-vae-tiling-fallback fails (exit 1); --vae-tiling tiles from the start (exit 0).

wbruna

These are mostly suggestions, but I'm marking as 'changes needed' anyway because Claude is listed as a co-author (see CONTRIBUTING.md).

wbruna · 2026-06-12T10:52:55Z

+    // Tristate with `enabled`: enabled => ON (always tile); else auto_tile => AUTO (tile only when
+    // an untiled VAE compute buffer can't be allocated, e.g. it exceeds the backend's max buffer
+    // size on an iGPU); else OFF (never tile, fail if the untiled buffer doesn't fit). Default AUTO.
+    // Appended (rather than folded into an enum) to keep the struct ABI backward-compatible.


I think these kind of detailed comments would be fine if the file already had detailed comments everywhere else.

Also (just suggestions, I don't know what @leejet would prefer): maybe we could use an extra_tiling_args parameter instead of a separate flag, since it'd be more useful as a workaround or for testing?

And rather than a simple on/off switch, maybe we could receive a threshold override here? Say, -1 for disabling, 0 for auto, > 0 as the new limit. That way, the user could increase it if they know the device can handle it, decrease it to save VRAM for other reasons, etc (working around the Vulkan 1G limit would be an immediate use case, too).

wbruna · 2026-06-12T12:16:35Z

+            // genuine runtime OOM (planned size <= max, but the device is full)
+            // is NOT caught here -- it still surfaces from the real reserve
+            // below, so the reactive fallback remains the backstop.
+            size_t max_size = ggml_backend_buft_get_max_size(buft);


I believe the size calculation is needlessly including the VAE weights here. By default I get:

[DEBUG] ggml_extend.hpp:1932 - vae: untiled compute buffer 2112.06 MB exceeds backend max single buffer 1024.00 MB; deferring to tiling

Fiddling with GGML_VK_SUBALLOCATION_BLOCK_SIZE env var, it shows:

[DEBUG] ggml_extend.hpp:1953 - vae compute buffer size: 1920.06 MB(VRAM)

But this may be kind of a moot point, at least on Vulkan: as far as I can tell by looking at the ggml code, the limit on Vulkan will by default be capped at 1G anyway (suballocation_block_size, which is the value reported by ggml_backend_buft_get_max_size; the more useful max_buffer_size doesn't seem to be accessible from the API 😕).

pwilkin · 2026-06-12T13:55:54Z

FWIW this is part of my sweeping memory-management / lazy-loading changes in #1470 as well.

RapidMark · 2026-06-12T17:21:48Z

Thanks @pwilkin, good to know. This one's deliberately narrow — it's an OOM safety-net in the VAE decode path: when an untiled decode would exceed the backend's max buffer size, it automatically falls back to tiling instead of erroring out. So it should be complementary to the broader memory-management / lazy-loading in #1470 rather than overlapping it. Happy to rebase on top of #1470, or defer to it entirely if you and leejet would rather land the consolidated approach — whatever keeps things cleanest.

RapidMark · 2026-06-12T17:44:44Z

@leejet — before I make any code changes here, I'd like your call on direction, since @wbruna raised two API options and deferred to your preference:

1. auto-fallback API. Currently a tristate on the VAE-tiling flag (off / on / auto — where auto tiles only when an untiled VAE compute buffer would exceed the backend's max single-buffer size). @wbruna suggested a threshold override instead — -1 disable, 0 auto, > 0 a user-set byte limit — letting a user raise the cap on a capable device, lower it to save VRAM, or work around Vulkan's ~1 GB suballocation cap. He also floated an extra_tiling_args-style parameter. Keep auto/on/off, or move to the threshold override?

2. size measurement. @wbruna noted the proactive measurement may include the VAE weights alongside the compute buffer (~2112 MB vs ~1920 MB), though it's likely moot on Vulkan since ggml_backend_buft_get_max_size returns the 1 GB suballocation block there. Refine, or leave as-is?

I'll hold the implementation until you point a direction, then make the changes.

wbruna · 2026-06-12T21:55:02Z

the more useful max_buffer_size doesn't seem to be accessible from the API 😕).

Actually, there is a way to access it: ggml_backend_vk_device_supports_op rejects any tensor larger than the max buffer size. The following attempt:

diff --git a/src/core/ggml_extend.hpp b/src/core/ggml_extend.hpp
index 4ebbc0a..2c14551 100644
--- a/src/core/ggml_extend.hpp
+++ b/src/core/ggml_extend.hpp
@@ -1922,6 +1922,21 @@ protected:
             // genuine runtime OOM (planned size <= max, but the device is full)
             // is NOT caught here -- it still surfaces from the real reserve
             // below, so the reactive fallback remains the backstop.
+            if (sd_backend_is(runtime_backend, "Vulkan")) {
+                size_t max_size = 0;
+                for (int i = 0; i < ggml_graph_n_nodes(gf); ++i) {
+                    ggml_tensor* op = ggml_graph_node(gf, i);
+                    max_size = std::max(ggml_nbytes(op), max_size);
+                    if (!ggml_backend_supports_op(runtime_backend, op)) {
+                        LOG_DEBUG("%s: untiled compute op size %.2f MB exceeds backend support; deferring to tiling",
+                                  get_desc().c_str(),
+                                  max_size / 1024.0 / 1024.0);
+                        compute_buffer_deferred_to_tiling = true;
+                        return false;
+                    }
+                }
+                LOG_DEBUG("%s: max op size = %.2f MB", get_desc().c_str(), max_size / 1024.0 / 1024.0);
+            } else {
             size_t max_size = ggml_backend_buft_get_max_size(buft);
             if (max_size > 0) {
                 ggml_gallocr* probe = ggml_gallocr_new(buft);
@@ -1937,6 +1952,7 @@ protected:
                     return false;
                 }
             }
+            }
         }
 
         compute_allocr = ggml_gallocr_new(buft);

worked perfectly with my card's 4GiB limit. With an SDXL 1024x960 gen, I've got:

[DEBUG] ggml_extend.hpp:1931 - vae: untiled compute op size 4320.00 MB exceeds backend support; deferring to tiling

while the slightly smaller 960x960 worked without tiling, despite the graph as a whole getting much larger than 4GiB:

[DEBUG] ggml_extend.hpp:1938 - vae: max op size = 4050.00 MB
[DEBUG] ggml_extend.hpp:1969 - vae compute buffer size: 9031.86 MB(VRAM)

Of course, that would assume that "unsupported" means "too large"; but a truly unsupported operation would end up failing in the same way as before.

RapidMark · 2026-06-14T20:46:39Z

Thanks @wbruna — I went and measured; your supports_op approach is the right call, so I've adopted it.

On the weights: the planned compute size scales exactly with output area (416 / 1664 / 3745 / 6657 MB at 256² / 512² / 768² / 1024²) with no constant offset, so the VAE weights aren't actually being counted in it — there was nothing to exclude there. The real difference your approach makes is using the device's true per-buffer limit instead of the suballocation cap, so it only tiles when a single op genuinely won't fit.

supports_op matched in every case — it stays untiled where the untiled decode really does fit and succeed, and defers to tiling exactly where an untiled decode actually OOMs. So I've switched the Vulkan path to the per-op check and kept the get_max_size probe for the other backends; I also rebuilt and ran the CUDA and ROCm paths to confirm no change there.

I also added a max_buffer_size=<bytes> key to extra_tiling_args: in auto mode it tiles when an untiled VAE compute buffer would exceed it, so a user can cap VRAM regardless of the device limit (works on any backend). Your -1/0 cases are already covered by the existing flags (--no-vae-tiling-fallback disables, default is auto), so this just adds the > 0 override and leaves --vae-tiling untouched — no compatibility break.

pwilkin · 2026-06-14T21:08:27Z

@RapidMark we'll probably do the other way around - you'll add your solution since it's targeted and I'll rework mine without it, just linked in case I had some cases covered which you missed (eg. might want temporal tiling on LTX instead of standard tiling).

…e backend buffer limit VAE decode can hard-fail on integrated / low-VRAM GPUs because the untiled compute buffer exceeds the backend's maximum single-buffer allocation (e.g. Vulkan's suballocation limit) even when total memory is plentiful. sd.cpp already supports tiling that keeps each compute buffer small, but it had to be requested up front with --vae-tiling, so users hit a hard failure one flag away from the working path. Make the fallback automatic and on by default: - sd_tiling_params_t gains a bool auto_tile (appended, so the C ABI stays compatible). In AUTO (the default: --vae-tiling off, auto_tile on) VAE::decode tries the untiled decode and, if its compute buffer can't be allocated, frees it and retries once with tiling. - --vae-tiling stays the original boolean flag (force tiling on); --no-vae-tiling-fallback turns the auto fallback off (hard-fail like before). - GGMLRunner gets an opt-in probe (set_probe_compute_buffer_fits) so AUTO can decline a too-large untiled decode before the backend emits its raw allocation error. On Vulkan it checks each op against the device's real per-buffer limit via ggml_backend_supports_op (the reported max buffer size, not the smaller suballocation block); other backends compare the planned compute buffer against ggml_backend_buft_get_max_size. The reactive output-empty -> tile path still backstops a genuine runtime OOM. - extra_tiling_args gains a max_buffer_size=<bytes> key: in AUTO the fallback also tiles when the planned untiled compute buffer would exceed it, letting a user cap VAE VRAM on any backend.

wbruna

Working fine for me on Vulkan.

The total allocation check (with ROCm, since the Vulkan test triggers much sooner) seems less reliable: I've hit a failure by just increasing the resolution until hitting full VRAM:

Details

[INFO ] stable-diffusion.cpp:4512 - sampling completed, taking 131.77s
[INFO ] stable-diffusion.cpp:4524 - generating 1 latent images completed, taking 131.77s
[INFO ] stable-diffusion.cpp:4197 - decoding 1 latents
[DEBUG] model_loader.cpp:987  - loading 138/244 tensors from /opt/sdif/models/VAE/ae_bf16.safetensors
  |##################################################| 138/138 - 472.50MB/s
[INFO ] model_loader.cpp:1224 - loading tensors completed, taking 0.20s (read: 0.00s, memcpy: 0.00s, convert: 0.04s, copy_to_backend: 0.04s)
[DEBUG] model_manager.cpp:218  - model manager prepared params backend buffer ( 94.57 MB, 138 tensors, VRAM)
[DEBUG] ggml_extend.hpp:2061 - vae compute buffer size: 14978.25 MB(VRAM)
[DEBUG] ggml_extend.hpp:61   - ROCm pool[0]: alloc of 1209.60 MiB failed, flushing 935.90 MiB of cached buffers and retrying
[ERROR] ggml_extend.hpp:70   - ROCm error: out of memory
[ERROR] ggml_extend.hpp:70   -   current device: 0, in function alloc at ggml/src/ggml-cuda/ggml-cuda.cu:449
[ERROR] ggml_extend.hpp:70   -   err
ggml/src/ggml-cuda/ggml-cuda.cu:103: ROCm error
[New LWP 2249503]
[New LWP 2249484]
[New LWP 2249483]
[New LWP 2249481]
[New LWP 2249480]
[New LWP 2249468]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
__syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
warning: 56     ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S: Arquivo ou diretório inexistente
#0  __syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
56      in ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S
#1  0x00007f089909b668 in __internal_syscall_cancel (a1=<optimized out>, a2=<optimized out>, a3=<optimized out>, a4=<optimized out>, a5=a5@entry=0, a6=a6@entry=0, nr=61) at ./nptl/cancellation.c:49
warning: 49     ./nptl/cancellation.c: Arquivo ou diretório inexistente
#2  0x00007f089909b6ad in __syscall_cancel (a1=<optimized out>, a2=<optimized out>, a3=<optimized out>, a4=<optimized out>, a5=a5@entry=0, a6=a6@entry=0, nr=61) at ./nptl/cancellation.c:75
75      in ./nptl/cancellation.c
#3  0x00007f08991067c7 in __GI___wait4 (pid=<optimized out>, stat_loc=<optimized out>, options=<optimized out>, usage=<optimized out>) at ../sysdeps/unix/sysv/linux/wait4.c:30
warning: 30     ../sysdeps/unix/sysv/linux/wait4.c: Arquivo ou diretório inexistente
#4  0x000055f37b9c617b in ggml_print_backtrace ()
#5  0x000055f37b9c62ce in ggml_abort ()
#6  0x000055f37af02282 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) ()
#7  0x000055f37af19640 in ggml_cuda_pool_leg::alloc(unsigned long, unsigned long*) ()
#8  0x000055f37af17e41 in ggml_cuda_op_mul_mat_cublas(ggml_backend_cuda_context&, ggml_tensor const*, ggml_tensor const*, ggml_tensor*, char const*, float const*, char const*, float*, long, long, long, long, ihipStream_t*) ()
#9  0x000055f37af16b8a in ggml_cuda_op_mul_mat(ggml_backend_cuda_context&, ggml_tensor const*, ggml_tensor const*, ggml_tensor*, void (*)(ggml_backend_cuda_context&, ggml_tensor const*, ggml_tensor const*, ggml_tensor*, char const*, float const*, char const*, float*, long, long, long, long, ihipStream_t*), void (*)(float const*, int const*, void*, ggml_type, long, long, long, long, long, long, long, long, ihipStream_t*)) ()
#10 0x000055f37af11532 in ggml_cuda_mul_mat(ggml_backend_cuda_context&, ggml_tensor const*, ggml_tensor const*, ggml_tensor*) ()
#11 0x000055f37af0c555 in ggml_cuda_graph_evaluate_and_capture(ggml_backend_cuda_context*, ggml_cgraph*, bool, bool, void const*) ()
#12 0x000055f37af0901c in ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) ()
#13 0x000055f37b9dcf0e in ggml_backend_graph_compute ()
#14 0x000055f37ad79dcf in std::optional<sd::Tensor<float> > GGMLRunner::execute_graph<float>(ggml_cgraph*, int, bool, bool, bool, bool, std::unordered_set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const*) ()
#15 0x000055f37ad7c664 in std::optional<sd::Tensor<float> > GGMLRunner::compute<float>(std::function<ggml_cgraph* ()>, int, bool, bool, bool, bool) ()
#16 0x000055f37ad7ca83 in AutoEncoderKL::_compute(int, sd::Tensor<float> const&, bool) ()
#17 0x000055f37ac0fd0a in VAE::decode(int, sd::Tensor<float> const&, sd_tiling_params_t, bool, bool, bool, bool) [clone .isra.0] ()
#18 0x000055f37ac121e9 in StableDiffusionGGML::decode_first_stage(sd::Tensor<float> const&, bool) [clone .isra.0] ()
#19 0x000055f37ac1e14c in generate_image ()
#20 0x000055f37aad7db4 in main ()

But it worked OK when I reduced available VRAM by running something else in parallel:

Details

[INFO ] stable-diffusion.cpp:4524 - generating 1 latent images completed, taking 102.56s
[INFO ] stable-diffusion.cpp:4197 - decoding 1 latents
[DEBUG] model_loader.cpp:987  - loading 138/244 tensors from /opt/sdif/models/VAE/ae_bf16.safetensors
  |##################################################| 138/138 - 472.50MB/s
[INFO ] model_loader.cpp:1224 - loading tensors completed, taking 0.20s (read: 0.01s, memcpy: 0.00s, convert: 0.04s, copy_to_backend: 0.03s)
[DEBUG] model_manager.cpp:218  - model manager prepared params backend buffer ( 94.57 MB, 138 tensors, VRAM)
[ERROR] ggml_extend.hpp:70   - ggml_backend_cuda_buffer_type_alloc_buffer: allocating 12637.90 MiB on device 0: cudaMalloc failed: out of memory
[ERROR] ggml_extend.hpp:70   - ggml_gallocr_reserve_n_impl: failed to allocate ROCm0 buffer of size 13251797248
[ERROR] ggml_extend.hpp:2054 - vae: failed to allocate the compute buffer
[ERROR] ggml_extend.hpp:2504 - vae alloc compute buffer failed
[WARN ] vae.hpp:225  - vae: untiled decode buffer exceeded the backend limit; retrying with tiling
[DEBUG] ggml_extend.hpp:882  - num tiles : 3, 5 
[DEBUG] ggml_extend.hpp:883  - optimal overlap : 0.375000, 0.406250 (targeting 0.340000)
[DEBUG] ggml_extend.hpp:884  - processing 15 tiles
[DEBUG] ggml_extend.hpp:2061 - vae compute buffer size: 1664.25 MB(VRAM)
  |===>                                              | 1/15 - 1.01s/it[DEBUG] ggml_extend.hpp:61   - ggml_backend_cuda_graph_compute: CUDA graph warmup complete
  |==================================================| 15/15 - 1.01it/s
[DEBUG] vae.hpp:263  - computing vae decode graph completed, taking 15.02s
[INFO ] stable-diffusion.cpp:4217 - latent 1 decoded, taking 15.03s
[INFO ] stable-diffusion.cpp:4221 - decode_first_stage completed, taking 15.03s
[INFO ] stable-diffusion.cpp:4661 - generate_image completed in 119.36s
[DEBUG] model_manager.cpp:759  - model manager releasing params backend buffer ( 94.57 MB, 138 tensors, VRAM)
[INFO ] main.cpp:463  - save result image 0 to './test_1781612279.png' (success)
[INFO ] main.cpp:535  - 1/1 images saved

Another attempt that worked:

Details

[INFO ] stable-diffusion.cpp:4524 - generating 1 latent images completed, taking 120.51s
[INFO ] stable-diffusion.cpp:4197 - decoding 1 latents
[DEBUG] model_loader.cpp:987  - loading 138/244 tensors from /opt/sdif/models/VAE/ae_bf16.safetensors
  |##################################################| 138/138 - 472.50MB/s
[INFO ] model_loader.cpp:1224 - loading tensors completed, taking 0.20s (read: 0.00s, memcpy: 0.00s, convert: 0.04s, copy_to_backend: 0.07s)
[DEBUG] model_manager.cpp:218  - model manager prepared params backend buffer ( 94.57 MB, 138 tensors, VRAM)
[DEBUG] ggml_extend.hpp:2061 - vae compute buffer size: 14042.11 MB(VRAM)
[DEBUG] ggml_extend.hpp:61   - ROCm pool[0]: alloc of 1134.00 MiB failed, flushing 877.50 MiB of cached buffers and retrying
[DEBUG] ggml_extend.hpp:61   - ROCm pool[0]: retry succeeded
[DEBUG] vae.hpp:263  - computing vae decode graph completed, taking 7.39s
[INFO ] stable-diffusion.cpp:4217 - latent 1 decoded, taking 7.39s
[INFO ] stable-diffusion.cpp:4221 - decode_first_stage completed, taking 7.39s
[INFO ] stable-diffusion.cpp:4661 - generate_image completed in 129.83s
[DEBUG] model_manager.cpp:759  - model manager releasing params backend buffer ( 94.57 MB, 138 tensors, VRAM)
[INFO ] main.cpp:463  - save result image 0 to './test_1781612278.png' (success)
[INFO ] main.cpp:535  - 1/1 images saved

RapidMark changed the title ~~feat: tristate --vae-tiling (off|on|auto) with automatic OOM fallback~~ feat: automatically fall back to VAE tiling when an untiled decode exceeds the backend buffer limit Jun 9, 2026

wbruna suggested changes Jun 12, 2026

View reviewed changes

RapidMark force-pushed the cloudhands/vae-auto-tiling branch from 1f3d27a to d5d134e Compare June 14, 2026 20:46

RapidMark requested a review from wbruna June 14, 2026 20:49

wbruna mentioned this pull request Jun 15, 2026

[Bug] regression: Vulkan memory management error after master-691-563137a #1659

Closed

RapidMark force-pushed the cloudhands/vae-auto-tiling branch from d5d134e to 615bdae Compare June 15, 2026 17:42

wbruna reviewed Jun 16, 2026

View reviewed changes

Conversation

RapidMark commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Repro

Change

Validation (AMD Radeon 8060S iGPU, Krea Q4, 1024²)

Uh oh!

wbruna commented Jun 8, 2026

Uh oh!

RapidMark commented Jun 8, 2026

Uh oh!

stduhpf commented Jun 9, 2026

Uh oh!

RapidMark commented Jun 9, 2026

Uh oh!

wbruna left a comment

Choose a reason for hiding this comment

Uh oh!

wbruna Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

wbruna Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

pwilkin commented Jun 12, 2026

Uh oh!

RapidMark commented Jun 12, 2026

Uh oh!

RapidMark commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wbruna commented Jun 12, 2026

Uh oh!

RapidMark commented Jun 14, 2026

Uh oh!

pwilkin commented Jun 14, 2026

Uh oh!

wbruna left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

RapidMark commented Jun 8, 2026 •

edited

Loading

RapidMark commented Jun 12, 2026 •

edited

Loading