LatentUM: Unleashing the Potential of Interleaved Cross-Modal Reasoning via a Latent-Space Unified Model

Jiachun Jin¹, Zetong Zhou¹, Xiao Yang², Hao Zhang³, Pengfei Liu¹, Jun Zhu², Zhijie Deng¹

¹Shanghai Jiao Tong University ²Tsinghua University ³UCSD

Overview

LatentUM unifies all modalities within a shared semantic latent space, enabling interleaved cross-modal reasoning without pixel-space mediation. Unlike existing unified models that require pixel decoding as a bridge between understanding and generation, LatentUM reasons directly over its own generated visual content.

Key Features

Shared Semantic Latent Space: Text and visual tokens share the same space, enabling direct cross-modal reasoning over generated visual content.
MBAQ: Visual tokenizer trained to preserve VLM understanding behavior rather than pixel reconstruction.
MoME: Decoupled understanding/generation branches with shared self-attention for cross-modal interaction.
Decoupled Pixel Decoder: Optional diffusion decoder for pixel rendering, trained independently to keep the latent space semantics-focused.

Demos

T2I Generation

Visual Spatial Planning

World Modeling

Getting Started

Installation

git clone https://github.com/SJTU-DENG-Lab/LatentUM.git
cd LatentUM
uv sync

Model Weights

Pre-trained weights are available on HuggingFace:

Model	Base	Description	Download
LatentUM_Base	InternVL3.5-4B	Base model for understanding + generation	Link
LatentUM_Vis-Plan	LatentUM_Base	Fine-tuned for visual spatial planning	Link
LatentUM_WM	LatentUM_Base	Fine-tuned for action-conditioned world modeling	Link
LatentUM_GenEval	LatentUM_Base	Fine-tuned for GenEval with self-reflection + pixel reward	Link
Pixel Decoder	stable-diffusion-3-medium	Pixel decoder	Link

Examples

Image Understanding

uv run python - <<'PY'
import torch

from model.latentum import LatentUMModel

dtype = torch.bfloat16
device = "cuda" if torch.cuda.is_available() else "cpu"

model = LatentUMModel.from_pretrained(
    "SJTU-DENG-Lab/LatentUM-Base",
    device = device,
    dtype  = dtype,
)
answer = model.answer(
    "asset/blue_apple.png",
    "Describe this image.",
)
print(answer)
PY

Image Generation

uv run python - <<'PY'
import torch

from model.decoder import LatentUMDecoderModel
from model.latentum import LatentUMModel

dtype = torch.bfloat16
device = "cuda" if torch.cuda.is_available() else "cpu"

model = LatentUMModel.from_pretrained(
    "SJTU-DENG-Lab/LatentUM-Base", # alternative: "SJTU-DENG-Lab/LatentUM-GenEval"
    device = device,
    dtype  = dtype,
)
decoder = LatentUMDecoderModel.from_pretrained(
    "SJTU-DENG-Lab/LatentUM-Decoder",
    device=device,
    dtype=dtype,
)
images = model.generate_images(
    "a photo of a cute dog",
    decoder       = decoder,
    show_progress = True,
)
images[0].save("generated.png")
print("saved to generated.png")
PY

Visual Spatial Planning

uv run python - <<'PY'
import torch

from model.decoder import LatentUMDecoderModel
from model.latentum import LatentUMModel
from model.latentum.spatial_planning import run_frozenlake_demo

dtype = torch.bfloat16
device = "cuda" if torch.cuda.is_available() else "cpu"

model = LatentUMModel.from_pretrained(
    "SJTU-DENG-Lab/LatentUM-Vis-Plan",
    device=device,
    dtype=dtype,
)
decoder = LatentUMDecoderModel.from_pretrained(
    "SJTU-DENG-Lab/LatentUM-Decoder",
    device=device,
    dtype=dtype,
)
result = run_frozenlake_demo(
    model,
    decoder,
    image                    = "asset/frozenlake_level6_000.png",
    output_dir               = "asset/frozenlake_demo",
    max_steps                = 16,
    max_text_tokens_per_step = 10,
    temperature              = 0.7,
    top_k                    = 50,
    top_p                    = 0.95,
    gif_duration             = 500,
)
print(result["full_text"])
print("saved to asset/frozenlake_demo")
print(f"gif saved to {result['gif_path']}")
PY

World Modeling

python script/run_latentum_wm.py

Interleaved SFT Example

See interleaved_sft_example.md for the full interleaved SFT example, including the JSONL data format, the shipped FrozenLake training sample in asset/frozenlake_interleaved_example/.

Citation

If you find this work useful, please cite:

@article{jin2026latentum,
  title={LatentUM: Unleashing the Potential of Interleaved Cross-Modal Reasoning via a Latent-Space Unified Model},
  author={Jin, Jiachun and Zhou, Zetong and Yang, Xiao and Zhang, Hao and Liu, Pengfei and Zhu, Jun and Deng, Zhijie},
  journal={arXiv preprint arXiv:2604.02097},
  year={2026}
}

Acknowledgements

We thank the authors of InternVL, BLIP3o, UniTok, and Stable Diffusion 3.5 for open-sourcing their great works!

License

This project is released under the Apache 2.0 License.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
asset		asset
model		model
script		script
.gitignore		.gitignore
README.md		README.md
interleaved_sft_example.md		interleaved_sft_example.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LatentUM: Unleashing the Potential of Interleaved Cross-Modal Reasoning via a Latent-Space Unified Model

Overview

Key Features

Demos

T2I Generation

Visual Spatial Planning

World Modeling

Getting Started

Installation

Model Weights

Examples

Image Understanding

Image Generation

Visual Spatial Planning

World Modeling

Interleaved SFT Example

Citation

Acknowledgements

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LatentUM: Unleashing the Potential of Interleaved Cross-Modal Reasoning via a Latent-Space Unified Model

Overview

Key Features

Demos

T2I Generation

Visual Spatial Planning

World Modeling

Getting Started

Installation

Model Weights

Examples

Image Understanding

Image Generation

Visual Spatial Planning

World Modeling

Interleaved SFT Example

Citation

Acknowledgements

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages