LatentUM: Unleashing the Potential of Interleaved Cross-Modal Reasoning via a Latent-Space Unified Model
Jiachun Jin1, Zetong Zhou1, Xiao Yang2, Hao Zhang3, Pengfei Liu1, Jun Zhu2, Zhijie Deng1
1Shanghai Jiao Tong University 2Tsinghua University 3UCSD
LatentUM unifies all modalities within a shared semantic latent space, enabling interleaved cross-modal reasoning without pixel-space mediation. Unlike existing unified models that require pixel decoding as a bridge between understanding and generation, LatentUM reasons directly over its own generated visual content.
- Shared Semantic Latent Space: Text and visual tokens share the same space, enabling direct cross-modal reasoning over generated visual content.
- MBAQ: Visual tokenizer trained to preserve VLM understanding behavior rather than pixel reconstruction.
- MoME: Decoupled understanding/generation branches with shared self-attention for cross-modal interaction.
- Decoupled Pixel Decoder: Optional diffusion decoder for pixel rendering, trained independently to keep the latent space semantics-focused.
git clone https://github.com/SJTU-DENG-Lab/LatentUM.git
cd LatentUM
uv syncPre-trained weights are available on HuggingFace:
| Model | Base | Description | Download |
|---|---|---|---|
| LatentUM_Base | InternVL3.5-4B | Base model for understanding + generation | Link |
| LatentUM_Vis-Plan | LatentUM_Base | Fine-tuned for visual spatial planning | Link |
| LatentUM_WM | LatentUM_Base | Fine-tuned for action-conditioned world modeling | Link |
| LatentUM_GenEval | LatentUM_Base | Fine-tuned for GenEval with self-reflection + pixel reward | Link |
| Pixel Decoder | stable-diffusion-3-medium | Pixel decoder | Link |
uv run python - <<'PY'
import torch
from model.latentum import LatentUMModel
dtype = torch.bfloat16
device = "cuda" if torch.cuda.is_available() else "cpu"
model = LatentUMModel.from_pretrained(
"SJTU-DENG-Lab/LatentUM-Base",
device = device,
dtype = dtype,
)
answer = model.answer(
"asset/blue_apple.png",
"Describe this image.",
)
print(answer)
PYuv run python - <<'PY'
import torch
from model.decoder import LatentUMDecoderModel
from model.latentum import LatentUMModel
dtype = torch.bfloat16
device = "cuda" if torch.cuda.is_available() else "cpu"
model = LatentUMModel.from_pretrained(
"SJTU-DENG-Lab/LatentUM-Base", # alternative: "SJTU-DENG-Lab/LatentUM-GenEval"
device = device,
dtype = dtype,
)
decoder = LatentUMDecoderModel.from_pretrained(
"SJTU-DENG-Lab/LatentUM-Decoder",
device=device,
dtype=dtype,
)
images = model.generate_images(
"a photo of a cute dog",
decoder = decoder,
show_progress = True,
)
images[0].save("generated.png")
print("saved to generated.png")
PYuv run python - <<'PY'
import torch
from model.decoder import LatentUMDecoderModel
from model.latentum import LatentUMModel
from model.latentum.spatial_planning import run_frozenlake_demo
dtype = torch.bfloat16
device = "cuda" if torch.cuda.is_available() else "cpu"
model = LatentUMModel.from_pretrained(
"SJTU-DENG-Lab/LatentUM-Vis-Plan",
device=device,
dtype=dtype,
)
decoder = LatentUMDecoderModel.from_pretrained(
"SJTU-DENG-Lab/LatentUM-Decoder",
device=device,
dtype=dtype,
)
result = run_frozenlake_demo(
model,
decoder,
image = "asset/frozenlake_level6_000.png",
output_dir = "asset/frozenlake_demo",
max_steps = 16,
max_text_tokens_per_step = 10,
temperature = 0.7,
top_k = 50,
top_p = 0.95,
gif_duration = 500,
)
print(result["full_text"])
print("saved to asset/frozenlake_demo")
print(f"gif saved to {result['gif_path']}")
PYpython script/run_latentum_wm.py
See interleaved_sft_example.md for the full interleaved SFT example, including the JSONL data format, the shipped FrozenLake training sample in asset/frozenlake_interleaved_example/.
If you find this work useful, please cite:
@article{jin2026latentum,
title={LatentUM: Unleashing the Potential of Interleaved Cross-Modal Reasoning via a Latent-Space Unified Model},
author={Jin, Jiachun and Zhou, Zetong and Yang, Xiao and Zhang, Hao and Liu, Pengfei and Zhu, Jun and Deng, Zhijie},
journal={arXiv preprint arXiv:2604.02097},
year={2026}
}
We thank the authors of InternVL, BLIP3o, UniTok, and Stable Diffusion 3.5 for open-sourcing their great works!
This project is released under the Apache 2.0 License.





