Detect int32 shape-product overflow at MLX compute-shape boundaries#3524
Detect int32 shape-product overflow at MLX compute-shape boundaries#3524qflen wants to merge 3 commits into
Conversation
Issue ml-explore#3327 reports that shapes whose per-dim values fit in int32 but whose product exceeds 2^31 silently produced wrapped results. `reshape(big, (-1,))` returned a negative inferred dim, `zeros((2^30, 2)).flatten()` returned shape (-2147483648,) and size 18446744071562067968, `take(big, ...)` failed via an internal flatten with the same wrap, and `conv_general` with output > 2^31 elements either requested an 18 EB allocation on M3 Max or silently wrote to wrapped offsets in the Metal kernel on M5 (`y[-1]` read back zeros). PR ml-explore#3425 kept `ShapeElem = int32_t` and added a clear diagnostic at the Python binding for per-dim overflow. This patch extends the same approach to the internal C++ compute-shape boundaries that produce a Shape from int64 arithmetic, and to the Metal conv kernel offsets where the product of valid per-dim values silently wrapped. - mlx/utils.h: new `check_shape_dim(int64_t, op)` helper using PR ml-explore#3425's error message format. - Compute-shape sites narrow through the helper: `Flatten` and `Reshape` `output_shape`, `unflatten` infer path, `indices_or_default` (accumulator widened to int64). Backend- agnostic -- applies to CPU, Metal, and CUDA. - mlx/backend/metal/conv.cpp: guard the four dispatcher sites where `int implicit_M = out.size() / O` truncates size_t or `int implicit_M = N * oS[0] * oS[1][*oS[2]]` wraps. Widen `inp_large` / `out_large` heuristics to int64 to remove signed-overflow UB on the dispatch predicate. - mlx/backend/metal/kernels/steel/conv/kernels/{steel_conv.h, steel_conv_3d.h, steel_conv_general.h}: promote per-thread output pointer arithmetic to size_t. With M < 2^31 but M * O > 2^31, `c_row * (N * groups) + c_col` overflowed even after the dispatcher accepted the shape -- last batches wrote to wrapped offsets. This is the substance of PRs ml-explore#3294 / ml-explore#3320, now exercised by an end-to-end test. - mlx/backend/cuda/conv/{gemm_conv,gemm_grouped_conv}.cu: same size_t->int truncation pattern as the Metal sites. Apply the identical guard. CUDA validation pending CI -- no toolchain on the authoring machine. Adds two regression tests in tests/gpu_tests.cpp. The kernel-offset test (varying per-batch input, allclose vs CPU reference) fails on `y[-1]` without the steel_conv_general.h patch -- verified by stash/restore. The shape-boundary test exercises each fix path; the eval branch is guarded by max_buffer_length so it skips on small-GPU devices. Closes ml-explore#3327. Resolves the cross-dim overflow path that ml-explore#3425 diagnosed but deferred (related ml-explore#2681).
| size_t needed = size_t(n) * 64 * 64 * sizeof(float16_t); | ||
| auto max_buf = std::get<size_t>(gpu::device_info().at("max_buffer_length")); | ||
| if (max_buf >= needed) { | ||
| CHECK_THROWS_AS(eval(y), std::overflow_error); |
There was a problem hiding this comment.
Allocating such a large array would drag down the test speed a lot, I prefer just removing this test since it is not a serious issue.
c9c584d to
f2600f4
Compare
|
Sorry I meant deleting the tests that actually allocates the 4GB array (including the "test gpu conv2d large output offset"), the tests that check overflow are nice to keep. |
|
Oh my bad, I'll rewrite. |
f2600f4 to
1bb7bb6
Compare
|
Standalone Confirmed on MLX 0.31.2, macOS Tahoe 26.4.1, Apple M1 Max 64 GB. import time
import mlx.core as mx
mx.set_default_device(mx.gpu)
mx.set_cache_limit(1024 * 1024 * 1024)
print("MLX:", mx.__version__)
# Output shape: (1, 457, 72, 128, 512)
# M = 1 * 457 * 72 * 128 = 4,211,712
# N = 512
# M * N = 2,156,396,544 > INT32_MAX (2,147,483,647)
#
# C=64 satisfies C_per_group % 16 == 0 (implicit path),
# and keeps input/weight memory reasonable.
x = mx.ones((1, 457, 72, 128, 64), dtype=mx.float16)
w = mx.ones((512, 3, 3, 3, 64), dtype=mx.float16)
t0 = time.perf_counter()
out = mx.conv3d(x, w, padding=1)
mx.eval(out)
print(f"conv eval: {time.perf_counter() - t0:.2f}s")
for label, d, expected in [
("before boundary", 454, 1728),
("after boundary", 455, 1728),
("last depth", 456, 1152),
]:
vals = out[0, d, 36, 64, :8]
mx.eval(vals)
print(f"{label:16s} expected={expected:4d} actual={vals}")
print(f"peak GB: {mx.get_peak_memory() / 1e9:.2f}")On 0.31.2 (before your fix): The overflow boundary is at This may be useful as a regression test for the |
|
Thanks for the repro and pinpointing the 2^31 / (72 * 128 * 512) boundary. The With @zcbenz's earlier feedback, the >=4 GB-allocating tests were dropped, and this one would also need ~4.3 GB to trigger (M * N > INT32_MAX). The host-side overflow checks in |
Summary
Fixes #3327.
Shapes whose per-dim values fit in
int32but whose product exceeds2^31silently wrapped at five sites:reshape(-1),flatten,take, the Metal conv dispatcher'sout.size() / Ocast, and per-thread output offsets inside the Metal conv kernels. All five now either succeed correctly or raisestd::overflow_errorwith a clear[op] Shape dimension ...message, extending the diagnostic style #3425 established at the Python binding to the internal C++ boundaries.ShapeElemstaysint32_t. No public API change. Theflattencase from #2681 is also resolved.Root cause
Flatten::output_shapeaccumulatedflat_size *= ...asint32_tviaauto.Reshape::output_shapeandunflattentruncatedinput.size() / sizefromsize_ttoShapeElem.indices_or_defaultusedstd::multiplies<int>.conv.cpp:int implicit_M = out.size() / Otruncated;int implicit_M = N * oS[0] * oS[1]wrapped. Wrapped neg dim fednbytes()which sign-extends to ~18 EB.steel_conv{,_3d,_general}.h: per-thread output offsetc_row * (N * groups) + c_colwrappedint32whenM < 2^31butM * O > 2^31.conv/{gemm_conv,gemm_grouped_conv}.cu: sameint mat_M = out.size() / Otruncation as the Metal sites.Change
check_shape_dim(int64_t, std::string_view op)inmlx/utils.h.Flatten,Reshape,unflatten,indices_or_defaultwiden accumulators toint64_tand narrow through the helper. Backend-agnostic.implicit_M/mat_M. Widen Metalinp_large/out_largetoint64_t.size_t. This is the substance of the prior attempts Fix int32 overflow in Metal conv_general output offset for large tensors #3294 and Fix conv_general output offset overflow in Metal writeback #3320, now exercised by an end-to-end test.Tests (
tests/gpu_tests.cpp)test gpu int32 shape overflow errorscoversflatten,reshape(-1),take,eval. Eval branch guarded bymax_buffer_length.test gpu conv2d large output offset: output 2.15 G fp16 elements, varying batch values, allclose vs CPU. Fails ony[-1]without thesteel_conv_general.hpatch (verified by stash/restore).Validation
Apple M5 32GB,
Release+MLX_METAL_JIT=OFF.test_ops.py(139),test_array.py(73),test_conv.py(18),test_compile.py(54), and the reproducer on CPU stream all pass.Performance
One host-side
int64compare per kernel dispatch. Trimmed-mean of 500 iters, per-call µs:conv2d (32,64,64,64) -> (..,128), Metalconv2d_general (16,64,64,1) -> (..,17), Metaltake((8192,64), 5 idx), Metalreshape((4096,1024) -> -1), Metalflatten((4096,1024)), Metalconv2d (4,64,64,32) -> (..,64), CPUreshape(-1), CPUflatten, CPUPatched is faster in 7/8 cells; deltas are within sub-ms jitter.
Scope
I noticed that same-class fixes exist at adjacent sites the reproducer does not reach:
mlx/backend/metal/matmul.cppgather-MM dispatcherstile/repeat/kron/concatenate/pad, CPUmlx/backend/cpu/conv.cpp, anddilate_size/conv_out_axis_size.Left for follow-up to keep this PR focused, but I can also work on them here.