You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
MLX is missing a primitive for allocating an uninitialized array, equivalent to numpy.empty / torch.empty / jnp.empty. This is useful when a buffer will be fully overwritten by a subsequent kernel — the implicit zero-fill of mx.zeros is wasted work in that case.
Adding mx.empty(shape, dtype=..., stream=...) would close that gap.
Motivation
The concrete use case we hit: when a TileLang Metal kernel produces an output tensor, the host-side allocation only needs the right shape/dtype/storage — the kernel will fully overwrite the contents. With only mx.zeros available today, we pay for a memset to zero before the kernel runs and then immediately overwrites every byte. For larger output tensors (e.g. attention outputs in a transformer block) the wasted zero-fill measurably hurts throughput.
The same pattern shows up any time an MLX array is used as a write-only output buffer of an external kernel (a custom Metal op, a DLPack-imported tensor about to be filled in place, etc.).
PyTorch / NumPy / JAX all expose this primitive (torch.empty, numpy.empty, jnp.empty) for the same reason.
Proposed API
mx.empty(shape, dtype=mx.float32, stream=None)
Semantics:
Allocates an array of the given shape and dtype on the active device.
Does not initialize the contents — the caller is expected to write into it before reading.
Reuses MLX's existing allocator and dtype rules, including the existing GPU float64 restriction.
Rejects negative dimensions with the standard MLX shape-validation error.
Optional stream= argument to match the rest of the MLX ops surface.
This is intentionally a thin wrapper around the existing allocation path — no new buffer-management complexity, just skipping the fill.
Prototype
We have a working implementation in our downstream fork:
Diff: 60 LOC across 4 files: mlx/ops.cpp, mlx/ops.h, python/src/ops.cpp, python/tests/test_ops.py.
The prototype exposes the API exactly as proposed above. Tests cover default dtype, explicit dtype, negative-shape rejection, and the GPU float64 rejection path.
What we're offering
If maintainers are interested, we can rebase the prototype on current ml-explore/mlx@main and open a PR. The patch is small and independent of the DLPack work in #3531 — no shared surface, no ordering requirement.
If the team would prefer a slightly different shape (e.g. dtype as the first positional argument, or a different stream= default), happy to adjust before opening the PR.
Notes
One open design question: in debug builds, should mx.empty fill with NaN / sentinel values to surface uninitialized-read bugs in user code? PyTorch doesn't do this; NumPy doesn't do this. Our prototype follows the same convention (raw allocation, no debug-fill). Flagging it here in case MLX has a different preference.
Summary
MLX is missing a primitive for allocating an uninitialized array, equivalent to
numpy.empty/torch.empty/jnp.empty. This is useful when a buffer will be fully overwritten by a subsequent kernel — the implicit zero-fill ofmx.zerosis wasted work in that case.Adding
mx.empty(shape, dtype=..., stream=...)would close that gap.Motivation
The concrete use case we hit: when a TileLang Metal kernel produces an output tensor, the host-side allocation only needs the right shape/dtype/storage — the kernel will fully overwrite the contents. With only
mx.zerosavailable today, we pay for amemsetto zero before the kernel runs and then immediately overwrites every byte. For larger output tensors (e.g. attention outputs in a transformer block) the wasted zero-fill measurably hurts throughput.The same pattern shows up any time an MLX array is used as a write-only output buffer of an external kernel (a custom Metal op, a DLPack-imported tensor about to be filled in place, etc.).
PyTorch / NumPy / JAX all expose this primitive (
torch.empty,numpy.empty,jnp.empty) for the same reason.Proposed API
Semantics:
float64restriction.stream=argument to match the rest of the MLX ops surface.This is intentionally a thin wrapper around the existing allocation path — no new buffer-management complexity, just skipping the fill.
Prototype
We have a working implementation in our downstream fork:
Add uninitialized array allocationmlx/ops.cpp,mlx/ops.h,python/src/ops.cpp,python/tests/test_ops.py.The prototype exposes the API exactly as proposed above. Tests cover default dtype, explicit dtype, negative-shape rejection, and the GPU
float64rejection path.What we're offering
If maintainers are interested, we can rebase the prototype on current
ml-explore/mlx@mainand open a PR. The patch is small and independent of the DLPack work in #3531 — no shared surface, no ordering requirement.If the team would prefer a slightly different shape (e.g.
dtypeas the first positional argument, or a differentstream=default), happy to adjust before opening the PR.Notes
mx.emptyfill with NaN / sentinel values to surface uninitialized-read bugs in user code? PyTorch doesn't do this; NumPy doesn't do this. Our prototype follows the same convention (raw allocation, no debug-fill). Flagging it here in case MLX has a different preference.