Detect int32 shape-product overflow at MLX compute-shape boundaries by qflen · Pull Request #3524 · ml-explore/mlx

qflen · 2026-05-11T08:22:44Z

Summary

Shapes whose per-dim values fit in int32 but whose product exceeds 2^31 silently wrapped at five sites: reshape(-1), flatten, take, the Metal conv dispatcher's out.size() / O cast, and per-thread output offsets inside the Metal conv kernels. All five now either succeed correctly or raise std::overflow_error with a clear [op] Shape dimension ... message, extending the diagnostic style #3425 established at the Python binding to the internal C++ boundaries.

ShapeElem stays int32_t. No public API change. The flatten case from #2681 is also resolved.

Root cause

Flatten::output_shape accumulated flat_size *= ... as int32_t via auto.
Reshape::output_shape and unflatten truncated input.size() / size from size_t to ShapeElem.
indices_or_default used std::multiplies<int>.
Metal conv.cpp: int implicit_M = out.size() / O truncated; int implicit_M = N * oS[0] * oS[1] wrapped. Wrapped neg dim fed nbytes() which sign-extends to ~18 EB.
Metal steel_conv{,_3d,_general}.h: per-thread output offset c_row * (N * groups) + c_col wrapped int32 when M < 2^31 but M * O > 2^31.
CUDA conv/{gemm_conv,gemm_grouped_conv}.cu: same int mat_M = out.size() / O truncation as the Metal sites.

Change

New check_shape_dim(int64_t, std::string_view op) in mlx/utils.h.
Flatten, Reshape, unflatten, indices_or_default widen accumulators to int64_t and narrow through the helper. Backend-agnostic.
Four Metal and two CUDA conv dispatchers guard implicit_M / mat_M. Widen Metal inp_large / out_large to int64_t.
Three Metal steel conv kernel headers promote per-thread output offset to size_t. This is the substance of the prior attempts Fix int32 overflow in Metal conv_general output offset for large tensors #3294 and Fix conv_general output offset overflow in Metal writeback #3320, now exercised by an end-to-end test.

Tests (`tests/gpu_tests.cpp`)

test gpu int32 shape overflow errors covers flatten, reshape(-1), take, eval. Eval branch guarded by max_buffer_length.
test gpu conv2d large output offset: output 2.15 G fp16 elements, varying batch values, allclose vs CPU. Fails on y[-1] without the steel_conv_general.h patch (verified by stash/restore).

Validation

Apple M5 32GB, Release + MLX_METAL_JIT=OFF.

CPU: test_ops.py (139), test_array.py (73), test_conv.py (18), test_compile.py (54), and the reproducer on CPU stream all pass.
GPU / Metal: 248/248. Metal API validation layer: 0 errors.
GPU / CUDA: deferred to CI; the change is a two-file mirror of the Metal dispatcher fix with no kernel-side component.

Performance

One host-side int64 compare per kernel dispatch. Trimmed-mean of 500 iters, per-call µs:

Workload	Baseline	Patched
`conv2d (32,64,64,64) -> (..,128)`, Metal	10080	9204
`conv2d_general (16,64,64,1) -> (..,17)`, Metal	1164	717
`take((8192,64), 5 idx)`, Metal	326	205
`reshape((4096,1024) -> -1)`, Metal	66	77
`flatten((4096,1024))`, Metal	83	74
`conv2d (4,64,64,32) -> (..,64)`, CPU	3623	3348
`reshape(-1)`, CPU	27	22
`flatten`, CPU	28	23

Patched is faster in 7/8 cells; deltas are within sub-ms jitter.

Scope

I noticed that same-class fixes exist at adjacent sites the reproducer does not reach:
mlx/backend/metal/matmul.cpp gather-MM dispatchers tile / repeat / kron / concatenate / pad, CPU mlx/backend/cpu/conv.cpp, and dilate_size / conv_out_axis_size.

Left for follow-up to keep this PR focused, but I can also work on them here.

Issue ml-explore#3327 reports that shapes whose per-dim values fit in int32 but whose product exceeds 2^31 silently produced wrapped results. `reshape(big, (-1,))` returned a negative inferred dim, `zeros((2^30, 2)).flatten()` returned shape (-2147483648,) and size 18446744071562067968, `take(big, ...)` failed via an internal flatten with the same wrap, and `conv_general` with output > 2^31 elements either requested an 18 EB allocation on M3 Max or silently wrote to wrapped offsets in the Metal kernel on M5 (`y[-1]` read back zeros). PR ml-explore#3425 kept `ShapeElem = int32_t` and added a clear diagnostic at the Python binding for per-dim overflow. This patch extends the same approach to the internal C++ compute-shape boundaries that produce a Shape from int64 arithmetic, and to the Metal conv kernel offsets where the product of valid per-dim values silently wrapped. - mlx/utils.h: new `check_shape_dim(int64_t, op)` helper using PR ml-explore#3425's error message format. - Compute-shape sites narrow through the helper: `Flatten` and `Reshape` `output_shape`, `unflatten` infer path, `indices_or_default` (accumulator widened to int64). Backend- agnostic -- applies to CPU, Metal, and CUDA. - mlx/backend/metal/conv.cpp: guard the four dispatcher sites where `int implicit_M = out.size() / O` truncates size_t or `int implicit_M = N * oS[0] * oS[1][*oS[2]]` wraps. Widen `inp_large` / `out_large` heuristics to int64 to remove signed-overflow UB on the dispatch predicate. - mlx/backend/metal/kernels/steel/conv/kernels/{steel_conv.h, steel_conv_3d.h, steel_conv_general.h}: promote per-thread output pointer arithmetic to size_t. With M < 2^31 but M * O > 2^31, `c_row * (N * groups) + c_col` overflowed even after the dispatcher accepted the shape -- last batches wrote to wrapped offsets. This is the substance of PRs ml-explore#3294 / ml-explore#3320, now exercised by an end-to-end test. - mlx/backend/cuda/conv/{gemm_conv,gemm_grouped_conv}.cu: same size_t->int truncation pattern as the Metal sites. Apply the identical guard. CUDA validation pending CI -- no toolchain on the authoring machine. Adds two regression tests in tests/gpu_tests.cpp. The kernel-offset test (varying per-batch input, allclose vs CPU reference) fails on `y[-1]` without the steel_conv_general.h patch -- verified by stash/restore. The shape-boundary test exercises each fix path; the eval branch is guarded by max_buffer_length so it skips on small-GPU devices. Closes ml-explore#3327. Resolves the cross-dim overflow path that ml-explore#3425 diagnosed but deferred (related ml-explore#2681).

zcbenz

Nice fix, thanks!

zcbenz · 2026-05-12T02:12:46Z

+  size_t needed = size_t(n) * 64 * 64 * sizeof(float16_t);
+  auto max_buf = std::get<size_t>(gpu::device_info().at("max_buffer_length"));
+  if (max_buf >= needed) {
+    CHECK_THROWS_AS(eval(y), std::overflow_error);


Allocating such a large array would drag down the test speed a lot, I prefer just removing this test since it is not a serious issue.

zcbenz · 2026-05-12T02:23:23Z

Sorry I meant deleting the tests that actually allocates the 4GB array (including the "test gpu conv2d large output offset"), the tests that check overflow are nice to keep.

qflen · 2026-05-12T02:25:40Z

Oh my bad, I'll rewrite.

ltctech · 2026-05-12T06:37:55Z

Standalone mx.conv3d reproducer for the steel_conv_3d.h output-offset overflow your PR fixes.

Confirmed on MLX 0.31.2, macOS Tahoe 26.4.1, Apple M1 Max 64 GB.

import time
import mlx.core as mx

mx.set_default_device(mx.gpu)
mx.set_cache_limit(1024 * 1024 * 1024)
print("MLX:", mx.__version__)

# Output shape: (1, 457, 72, 128, 512)
# M = 1 * 457 * 72 * 128 = 4,211,712
# N = 512
# M * N = 2,156,396,544  >  INT32_MAX (2,147,483,647)
#
# C=64 satisfies C_per_group % 16 == 0 (implicit path),
# and keeps input/weight memory reasonable.

x = mx.ones((1, 457, 72, 128, 64), dtype=mx.float16)
w = mx.ones((512, 3, 3, 3, 64), dtype=mx.float16)
t0 = time.perf_counter()
out = mx.conv3d(x, w, padding=1)
mx.eval(out)
print(f"conv eval: {time.perf_counter() - t0:.2f}s")

for label, d, expected in [
    ("before boundary", 454, 1728),
    ("after boundary",  455, 1728),
    ("last depth",      456, 1152),
]:
    vals = out[0, d, 36, 64, :8]
    mx.eval(vals)
    print(f"{label:16s} expected={expected:4d} actual={vals}")

print(f"peak GB: {mx.get_peak_memory() / 1e9:.2f}")

On 0.31.2 (before your fix):

before boundary  expected=1728 actual=array([1728, 1728, 1728, ..., 1728, 1728, 1728], dtype=float16)
after boundary   expected=1728 actual=array([0, 0, 0, ..., 0, 0, 0], dtype=float16)
last depth       expected=1152 actual=array([0, 0, 0, ..., 0, 0, 0], dtype=float16)

The overflow boundary is at 2^31 / (72 * 128 * 512) = depth index 455.11, which is exactly where the output goes to zero. The affected region is unwritten (no garbage values, just zeros), consistent with the stores targeting an incorrect address and missing the intended output region entirely.

This may be useful as a regression test for the steel_conv_3d.h path specifically, since it exercises mx.conv3d directly rather than conv_general.

qflen · 2026-05-12T13:16:53Z

Thanks for the repro and pinpointing the 2^31 / (72 * 128 * 512) boundary.

The size_t cast on steel_conv_3d.h:97 covers your case since per-thread offsets store_result_safe adds on top are tile-bounded, so they don't re-introduce overflow.

With @zcbenz's earlier feedback, the >=4 GB-allocating tests were dropped, and this one would also need ~4.3 GB to trigger (M * N > INT32_MAX). The host-side overflow checks in gpu_tests.cpp stay.

zcbenz reviewed May 12, 2026

View reviewed changes

Comment thread mlx/backend/metal/kernels/steel/conv/kernels/steel_conv.h Outdated

Comment thread mlx/utils.h Outdated

Address review feedback

a5c5211

qflen requested a review from zcbenz May 12, 2026 02:04

zcbenz approved these changes May 12, 2026

View reviewed changes

qflen force-pushed the fix/int32-shape-overflow branch from c9c584d to f2600f4 Compare May 12, 2026 02:20

Drop 4GB allocation tests

1bb7bb6

qflen force-pushed the fix/int32-shape-overflow branch from f2600f4 to 1bb7bb6 Compare May 12, 2026 02:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detect int32 shape-product overflow at MLX compute-shape boundaries#3524

Detect int32 shape-product overflow at MLX compute-shape boundaries#3524
qflen wants to merge 3 commits into
ml-explore:mainfrom
qflen:fix/int32-shape-overflow

qflen commented May 11, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

zcbenz left a comment

Uh oh!

zcbenz May 12, 2026

Uh oh!

zcbenz commented May 12, 2026 •

edited

Loading

Uh oh!

qflen commented May 12, 2026

Uh oh!

ltctech commented May 12, 2026

Uh oh!

qflen commented May 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

qflen commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause

Change

Tests (tests/gpu_tests.cpp)

Validation

Performance

Scope

Uh oh!

Uh oh!

Uh oh!

zcbenz left a comment

Choose a reason for hiding this comment

Uh oh!

zcbenz May 12, 2026

Choose a reason for hiding this comment

Uh oh!

zcbenz commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qflen commented May 12, 2026

Uh oh!

ltctech commented May 12, 2026

Uh oh!

qflen commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

qflen commented May 11, 2026 •

edited

Loading

Tests (`tests/gpu_tests.cpp`)

zcbenz commented May 12, 2026 •

edited

Loading

qflen commented May 12, 2026 •

edited

Loading