You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have a working downstream prototype that lets MLX consume DLPack-exporting
objects:
CPU DLPack capsules / producers.
MLX self round-trips through mx.array(x.dlpack()).
Metal-resident kDLMetal capsules where data is a foreign
MTL::Buffer*.
TileLang / TVM-FFI Metal tensors exported as DLPack and re-wrapped as
mx.array without a host copy.
This is directly related to ml-explore/mlx issue #2848, where the current
behavior is that mx.array(...) accepts CPU tensors from other frameworks but
not device tensors. The maintainer comment said that accepting MPS / CUDA
arrays may be possible, while PyTorch MPS is blocked until PyTorch exports MPS
DLPack.
Scope note: this PoC does not implement generic CUDA DLPack import yet. It
supports CPU and Metal. kDLCUDA is explicitly rejected today, so CUDA should
be treated as a follow-up if the MLX team wants parity with the new MLX CUDA
backend.
Verified State
I re-checked the repos on 2026-05-13.
Upstream MLX:
text
ml-explore/mlx main 8f4099d
2026-05-12 [CUDA] Guard qmm_naive scale and bias loads at tile boundaries (#3509)
Downstream PoC:
text
DatasunriseOU/mlx main 3a6039d
2026-05-13 Add Python array bridge dylib
Merge-base:
text b08ec31
2026-05-11 Fix scatter_prod GPU hang on NaN with contention (#3492)
Ahead/behind after fetching both remotes:
text
upstream/main...origin/main = 5 behind / 4 ahead
Net downstream delta from the merge-base:
text
25 files changed, 1282 insertions(+), 24 deletions(-)
That is the measured diff for the current downstream branch; the count should
be refreshed before opening PRs because ml-explore/mlx@main is moving.
Why This Matters
MLX already exports DLPack from mx.array:
mx.array.dlpack
mx.array.dlpack_device
But upstream MLX does not currently expose the corresponding consumer side:
no public mx.from_dlpack.
no mx.array(...) path that accepts a DLPack capsule or producer object.
For TileLang, the missing consumer path means the zero-copy path is only
half-duplex:
MLX array -> DLPack -> TVM/TileLang Metal tensor works.
The practical use case is simple: TileLang emits Metal kernels, TVM-FFI
executes them, and MLX owns the surrounding model graph. All tensors are on the
same Apple GPU memory system. Without a consumer path, the boundary either
forces a copy or forces framework-specific native glue.
Chronology
MLX side
The PoC branch has four downstream commits over the merge-base:
Allocates an array without initializing the payload.
Rejects negative dimensions.
Keeps the existing MLX GPU float64 restriction.
Why it matters here:
TileLang output buffers are write-only kernel results. Zero-filling them with
mx.zeros is unnecessary work.
tilelang.contrib.mlx_interop.mlx_metal_output(...) already uses
mx.empty when present and falls back to mx.zeros for older MLX builds.
This can be upstreamed first because it is small and useful outside DLPack.
2. CPU DLPack Consumer
Files:
python/src/dlpack_consumer.cpp
python/src/dlpack_consumer.h
python/src/dlpack_consumer_no_metal.cpp
python/src/dlpack_format.h
python/src/convert.cpp
python/tests/test_dlpack_consumer.py
Behavior:
Accepts either a raw PyCapsule or an object exposing dlpack.
Recognizes both legacy dltensor and versioned dltensor_versioned
capsules.
Converts DLPack dtype to MLX dtype for scalar lanes.
Requires row-contiguous layout.
Rejects dtype override if it would require a copy or cast.
Leaves rejected capsules unconsumed.
Renames consumed capsules to the correct used-capsule name and calls the
DLPack deleter when MLX releases the wrapping storage.
Wraps CPU producer pointers through mx::allocator::make_buffer(...); if the
active allocator cannot expose the pointer zero-copy, the import fails rather
than silently staging a copy.
Known upstream-readiness gap:
The code recognizes dltensor_versioned, but it does not yet validate
DLPackVersion major/minor compatibility. Current DLPack headers define
version 1.3 and explicitly require safe handling of major-version mismatch.
A PR should add that check before merge.
3. Metal DLPack Consumer
Files:
python/src/dlpack_consumer_metal.cpp
mlx/backend/metal/custom_kernel.cpp
mlx/backend/metal/device.cpp
mlx/backend/metal/device.h
python/src/array.cpp
python/src/convert.cpp
python/src/convert.h
python/src/metal.cpp
python/tests/test_array.py
python/tests/test_device.py
python/tests/test_fast.py
python/tests/test_dlpack_consumer.py
Behavior:
Accepts kDLMetal tensors.
Treats DLTensor.data as an MTL::Buffer*.
Requires MTLStorageModeShared.
Rejects Managed and Private storage today.
Rejects non-zero byte_offset.
Rejects non-row-contiguous strides.
Checks that shape and dtype fit inside the exported MTLBuffer.
Wraps the foreign MTL::Buffer* directly in MLX storage and lets the DLPack
owner lifetime keep the producer allocation alive.
Important scope note:
The PoC does not currently accept MTLStorageModeManaged.
The code accepts Shared only.
Why Shared-only is a reasonable first cut:
MLX arrays normally use shared Metal buffers on Apple Silicon.
TileLang / TVM-FFI can export shared-mode Metal buffers for MLX.
Private buffers would need an explicit copy or command-buffer-mediated
synchronization path; that is a separate design.
4. Python Array Bridge
Files:
python/src/array_wrapper.cpp
python/src/CMakeLists.txt
Behavior:
Builds libmlx_python_bridge.dylib.
Exposes mlx_core_wrap_mx_array_move(mx::array*).
Lets a native extension create a C++ mx::array and return a Python
mlx.core.array without going through DLPack.
This is useful for TileLang's native MLX graph primitive, but it is not part of
the minimal DLPack consumer story. It should be a later PR only if MLX
maintainers want a supported extension ABI for returning native arrays.
It does not import kDLCUDA tensors. The PoC explicitly rejects CUDA DLPack.
It does not implement arbitrary strided imports. Non-row-contiguous inputs
are rejected with a clear error.
It does not implement hidden dtype conversion. DLPack import is zero-copy or
an error.
It does not implement the new DLPack C exchange API
(dlpack_c_exchange_api). That can be a future optimization.
It does not yet solve GPU stream synchronization in a general cross-framework
way. The current Metal path is sufficient for the TileLang / MLX flow, but
upstream should decide the public contract.
Promised vs Done
This section is deliberately blunt. It separates what the original downstream
story implied from what is actually finished today and what must be completed
before asking MLX maintainers to merge anything.
mx.empty exists and avoids zero-fill for write-only outputs.
Status: done in 4acd37a; TileLang uses it with a mx.zeros
fallback.
Finish: rebase, keep small, add or keep shape and dtype tests.
MLX can consume CPU DLPack producers.
Status: mostly done in c0cda6e; raw capsules and producer objects
work for row-contiguous CPU tensors.
Finish: add public mx.from_dlpack, add version checks, make error
policy final.
MLX can consume Metal DLPack producers zero-copy.
Status: partially done in 41ec3f5; works for kDLMetal, Shared
storage, row-contiguous layout, and zero byte offset.
Status: not done by design. The PoC rejects dtype overrides that require
copy/cast.
Finish: keep zero-copy-only semantics or add explicit copy mode later.
General cross-framework GPU stream synchronization is solved.
Status: not done. TileLang has graph/device-event handling for its path,
but MLX DLPack import has no generic stream contract yet.
Finish: define dlpack(stream=...) semantics for Metal before broad
device interop claims.
Native extension can return Python mlx.core.array from C++ mx::array.
Status: done downstream in 3a6039d, but it is separate from DLPack.
Finish: decide whether MLX wants this ABI; otherwise keep downstream-only.
The main unfinished items are therefore:
Add an explicit mx.from_dlpack API.
Add DLPack version compatibility validation.
Decide and document Metal stream semantics.
Decide whether Metal import remains Shared-only.
Keep CUDA, PyTorch MPS bypasses, arbitrary strides, dtype conversion, and
native Python array wrapping out of the first consumer PR unless MLX
maintainers explicitly ask for them.
Completion Plan
Before PR 1 (mx.empty)
Rebase the mx.empty commit on current ml-explore/mlx@main.
Add the public Python binding mx.from_dlpack(obj).
Route it to the existing dlpack_to_mlx(...) implementation.
Add explicit tests for:
raw legacy dltensor capsule.
producer object exposing dlpack.
consumed capsule rejection.
rejected capsule remains unconsumed.
dtype mismatch rejection.
non-row-contiguous rejection.
versioned capsule path.
Add DLPack major/minor validation:
if major mismatches, call the deleter and fail without reading unsafe
fields.
if minor is newer, allow only if MLX understands the fields it uses.
Decide whether CPU import may fall back to a copy. The current PoC is
zero-copy-or-error.
Before PR 3 (kDLMetal)
Keep the first Metal PR Shared-only unless maintainers request Managed.
Add tests for:
Shared storage import.
Private / Managed rejection, or supported behavior if policy changes.
non-zero byte offset rejection.
non-row-contiguous rejection.
shape/dtype requiring more bytes than buffer length.
non-Metal build rejecting kDLMetal cleanly.
Define how dlpack(stream=...) should map to MLX Metal command-buffer
behavior, or explicitly state that first-cut import assumes producer-side
synchronization.
Before PR 4 (mx.array implicit dispatch)
Decide protocol precedence with maintainers.
Add tests for objects exposing multiple protocols:
mlx_array plus dlpack.
NumPy array protocol plus dlpack.
raw PyCapsule.
plain DLPack producer object.
Ensure mx.array(obj, dtype=...) has a clear copy/cast policy for DLPack
inputs.
Before PR 5 (native Python array bridge)
Decide whether MLX wants to expose a native extension ABI at all.
If yes, document ownership:
caller passes new mx::array(...).
mlx_core_wrap_mx_array_move takes ownership.
returned Python object owns the moved C++ array.
Add a minimal external-extension style test, not only in-tree use.
Proposed Upstream PR Sequence
PR 1: [ops] Add mx.empty
Scope:
mx.empty(shape, dtype=..., stream=...)
Python binding and tests.
Why first:
Small diff.
Useful independently.
Lets downstream code allocate write-only output buffers without a zero-fill.
PR 2: [Python] Add explicit DLPack consumer for CPU
Recommended API:
Add mx.from_dlpack(obj) first.
Optionally wire mx.array(obj) to call the same path after maintainers agree
on implicit dispatch precedence.
Why explicit first:
Matches NumPy and JAX.
Avoids surprising mx.array(...) behavior for objects that expose multiple
protocols.
Gives tests a clear target for ownership and error-path behavior.
Scope:
DLPack capsule / producer parsing.
CPU zero-copy import.
consumed-capsule lifetime handling.
dtype/shape/stride validation.
versioned capsule compatibility checks.
no Metal, no CUDA.
PR 3: [Metal] Add kDLMetal DLPack consumer
Scope:
kDLMetal import behind MLX_BUILD_METAL.
MTL::Buffer* wrapping.
Shared storage mode only for the first PR.
clear errors for Private, Managed, non-zero byte offset, and
non-row-contiguous strides.
tests skipped when Metal is unavailable.
Open design point:
Whether Managed should be accepted on Intel-era Macs or rejected until
explicit coherency handling exists.
Whether Private should fail or stage through a copy.
What dlpack(stream=...) should mean for Metal command buffers.
PR 4: mx.array(...) implicit dispatch
Scope:
Make mx.array(obj) consume DLPack producers when appropriate.
Preserve MLX's existing protocol precedence.
Current PoC precedence:
native scalars/lists/tuples.
existing mlx.core.array.
raw DLPack capsule.
DLPack producer if no mlx_array or NumPy array protocol is present.
nanobind ndarray / NumPy array path.
mlx_array.
final DLPack fallback.
generic accessor path.
That precedence works for our tests, but maintainers should choose the final
policy.
PR 5: optional native Python array bridge
Scope:
libmlx_python_bridge.dylib.
mlx_core_wrap_mx_array_move(mx::array*).
This is not required for DLPack. It is useful for native extensions, including
TileLang's MLX graph primitive, but it creates a public-ish ABI surface. It
should be discussed separately.
We have a working downstream prototype that lets MLX consume DLPack-exporting
objects:
MTL::Buffer*.
mx.array without a host copy.
This is directly related to ml-explore/mlx issue #2848, where the current
behavior is that mx.array(...) accepts CPU tensors from other frameworks but
not device tensors. The maintainer comment said that accepting MPS / CUDA
arrays may be possible, while PyTorch MPS is blocked until PyTorch exports MPS
DLPack.
Issue link:
mx.arrayfrommpsandcudaarrays from other frameworks #2848Proof-of-concept branch:
Scope note: this PoC does not implement generic CUDA DLPack import yet. It
supports CPU and Metal. kDLCUDA is explicitly rejected today, so CUDA should
be treated as a follow-up if the MLX team wants parity with the new MLX CUDA
backend.
Verified State
I re-checked the repos on 2026-05-13.
Upstream MLX:
text
ml-explore/mlx main
8f4099d
2026-05-12 [CUDA] Guard qmm_naive scale and bias loads at tile boundaries (#3509)
Downstream PoC:
text
DatasunriseOU/mlx main
3a6039d
2026-05-13 Add Python array bridge dylib
Merge-base:
text
b08ec31
2026-05-11 Fix scatter_prod GPU hang on NaN with contention (#3492)
Ahead/behind after fetching both remotes:
text
upstream/main...origin/main = 5 behind / 4 ahead
Net downstream delta from the merge-base:
text
25 files changed, 1282 insertions(+), 24 deletions(-)
That is the measured diff for the current downstream branch; the count should
be refreshed before opening PRs because ml-explore/mlx@main is moving.
Why This Matters
MLX already exports DLPack from mx.array:
But upstream MLX does not currently expose the corresponding consumer side:
For TileLang, the missing consumer path means the zero-copy path is only
half-duplex:
patches.
The practical use case is simple: TileLang emits Metal kernels, TVM-FFI
executes them, and MLX owns the surrounding model graph. All tensors are on the
same Apple GPU memory system. Without a consumer path, the boundary either
forces a copy or forces framework-specific native glue.
Chronology
MLX side
The PoC branch has four downstream commits over the merge-base:
c0cda6e - 2026-05-11 - Fix mx.array DLPack dispatch
41ec3f5 - 2026-05-11 - Support DLPack Metal interop for cppmega
[Metal] Add DLPack consumer for MTLBuffer-backed tensors.
4acd37a - 2026-05-11 - Add uninitialized array allocation
easiest independent PR and should be submitted first.
3a6039d - 2026-05-13 - Add Python array bridge dylib
mlx_core_wrap_mx_array_move(mx::array*) symbol.
values and need to return Python mlx.core.array objects.
separate, optional design discussion.
TileLang side
TileLang work that depends on the MLX PoC landed in this order:
mx.empty is unavailable.
wrapper.
The TileLang tests exercising this path live mainly in:
The core TileLang interop code is in:
What the PoC Actually Implements
1. mx.empty
Files:
Behavior:
Why it matters here:
mx.zeros is unnecessary work.
mx.empty when present and falls back to mx.zeros for older MLX builds.
This can be upstreamed first because it is small and useful outside DLPack.
2. CPU DLPack Consumer
Files:
Behavior:
capsules.
DLPack deleter when MLX releases the wrapping storage.
active allocator cannot expose the pointer zero-copy, the import fails rather
than silently staging a copy.
Known upstream-readiness gap:
DLPackVersion major/minor compatibility. Current DLPack headers define
version 1.3 and explicitly require safe handling of major-version mismatch.
A PR should add that check before merge.
3. Metal DLPack Consumer
Files:
Behavior:
owner lifetime keep the producer allocation alive.
Important scope note:
Why Shared-only is a reasonable first cut:
synchronization path; that is a separate design.
4. Python Array Bridge
Files:
Behavior:
mlx.core.array without going through DLPack.
This is useful for TileLang's native MLX graph primitive, but it is not part of
the minimal DLPack consumer story. It should be a later PR only if MLX
maintainers want a supported extension ABI for returning native arrays.
What This Does Not Claim
export MPS tensors via dlpack, as noted in [Feature] Construct
mx.arrayfrommpsandcudaarrays from other frameworks #2848.are rejected with a clear error.
an error.
(dlpack_c_exchange_api). That can be a future optimization.
way. The current Metal path is sufficient for the TileLang / MLX flow, but
upstream should decide the public contract.
Promised vs Done
This section is deliberately blunt. It separates what the original downstream
story implied from what is actually finished today and what must be completed
before asking MLX maintainers to merge anything.
mx.empty exists and avoids zero-fill for write-only outputs.
fallback.
MLX can consume CPU DLPack producers.
work for row-contiguous CPU tensors.
policy final.
MLX can consume Metal DLPack producers zero-copy.
storage, row-contiguous layout, and zero byte offset.
offsets stay rejected.
The implementation supports DLPack versioned capsules.
the safe prefix.
The RFC answers [Feature] Construct
mx.arrayfrommpsandcudaarrays from other frameworks #2848 for MPS/CUDA arrays.not CUDA. PyTorch MPS remains blocked by PyTorch.
mx.arrayfrommpsandcudaarrays from other frameworks #2848 as CPU + Metal progress; split CUDA into a future PR.mx.from_dlpack(obj) is available.
and implicit mx.array(...) dispatch.
mx.array(obj) automatically consumes DLPack producers.
implicit dispatch.
MTLStorageModeShared and Managed both work.
Private.
Shared-only.
kDLCUDA works.
Non-contiguous DLPack tensors work.
MLX storage/view semantics.
Non-zero Metal byte_offset works.
otherwise keep explicit rejection.
Dtype conversion during import works.
copy/cast.
General cross-framework GPU stream synchronization is solved.
but MLX DLPack import has no generic stream contract yet.
device interop claims.
Native extension can return Python mlx.core.array from C++ mx::array.
The main unfinished items are therefore:
native Python array wrapping out of the first consumer PR unless MLX
maintainers explicitly ask for them.
Completion Plan
Before PR 1 (mx.empty)
Before PR 2 (CPU mx.from_dlpack)
fields.
zero-copy-or-error.
Before PR 3 (kDLMetal)
behavior, or explicitly state that first-cut import assumes producer-side
synchronization.
Before PR 4 (mx.array implicit dispatch)
inputs.
Before PR 5 (native Python array bridge)
Proposed Upstream PR Sequence
PR 1: [ops] Add mx.empty
Scope:
Why first:
PR 2: [Python] Add explicit DLPack consumer for CPU
Recommended API:
on implicit dispatch precedence.
Why explicit first:
protocols.
Scope:
PR 3: [Metal] Add kDLMetal DLPack consumer
Scope:
non-row-contiguous strides.
Open design point:
explicit coherency handling exists.
PR 4: mx.array(...) implicit dispatch
Scope:
Current PoC precedence:
That precedence works for our tests, but maintainers should choose the final
policy.
PR 5: optional native Python array bridge
Scope:
This is not required for DLPack. It is useful for native extensions, including
TileLang's MLX graph primitive, but it creates a public-ish ABI surface. It
should be discussed separately.
Future PR: CUDA DLPack Consumer
If MLX wants the CUDA half of #2848 addressed:
buffers or require explicit mx.from_dlpack.
The current PoC intentionally does not do this.
Open Questions for MLX Maintainers
(mx.array), or both?
and NumPy-style protocols, as in the PoC?
accept initially?
the producer satisfy?
1.3; the PoC needs explicit version validation before merge.
into Python mlx.core.array, or should that stay downstream-only?
be a separate issue after CPU + Metal land?
What We Are Offering
If the maintainers are interested, we can:
mx.from_dlpack.
wants that native-extension hook.
mx.arrayfrommpsandcudaarrays from other frameworks #2848 now.If the MLX team prefers to implement the feature themselves, the downstream
code is public and we can walk through the design and test cases.
References
mx.arrayfrommpsandcudaarrays from other frameworks #2848:[Feature] Construct
mx.arrayfrommpsandcudaarrays from other frameworks #28483rdparty/tvm/3rdparty/tvm-ffi/3rdparty/dlpack/include/dlpack/dlpack.h
https://github.com/DatasunriseOU/mlx
testing/python/metal/test_tvm_ffi_metal_stream_dlpack.py