Skip to content

SJTU-DENG-Lab/LatentUM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LatentUM: Unleashing the Potential of Interleaved Cross-Modal Reasoning via a Latent-Space Unified Model

Jiachun Jin1, Zetong Zhou1, Xiao Yang2, Hao Zhang3, Pengfei Liu1, Jun Zhu2, Zhijie Deng1

1Shanghai Jiao Tong University    2Tsinghua University    3UCSD

Paper HuggingFace

Overview

LatentUM unifies all modalities within a shared semantic latent space, enabling interleaved cross-modal reasoning without pixel-space mediation. Unlike existing unified models that require pixel decoding as a bridge between understanding and generation, LatentUM reasons directly over its own generated visual content.

Key Features

  • Shared Semantic Latent Space: Text and visual tokens share the same space, enabling direct cross-modal reasoning over generated visual content.
  • MBAQ: Visual tokenizer trained to preserve VLM understanding behavior rather than pixel reconstruction.
  • MoME: Decoupled understanding/generation branches with shared self-attention for cross-modal interaction.
  • Decoupled Pixel Decoder: Optional diffusion decoder for pixel rendering, trained independently to keep the latent space semantics-focused.

Demos

T2I Generation

Visual Spatial Planning

forward turn right

World Modeling

forward turn right

Getting Started

Installation

git clone https://github.com/SJTU-DENG-Lab/LatentUM.git
cd LatentUM
uv sync

Model Weights

Pre-trained weights are available on HuggingFace:

Model Base Description Download
LatentUM_Base InternVL3.5-4B Base model for understanding + generation Link
LatentUM_Vis-Plan LatentUM_Base Fine-tuned for visual spatial planning Link
LatentUM_WM LatentUM_Base Fine-tuned for action-conditioned world modeling Link
LatentUM_GenEval LatentUM_Base Fine-tuned for GenEval with self-reflection + pixel reward Link
Pixel Decoder stable-diffusion-3-medium Pixel decoder Link

Examples

Image Understanding

uv run python - <<'PY'
import torch

from model.latentum import LatentUMModel

dtype = torch.bfloat16
device = "cuda" if torch.cuda.is_available() else "cpu"

model = LatentUMModel.from_pretrained(
    "SJTU-DENG-Lab/LatentUM-Base",
    device = device,
    dtype  = dtype,
)
answer = model.answer(
    "asset/blue_apple.png",
    "Describe this image.",
)
print(answer)
PY

Image Generation

uv run python - <<'PY'
import torch

from model.decoder import LatentUMDecoderModel
from model.latentum import LatentUMModel

dtype = torch.bfloat16
device = "cuda" if torch.cuda.is_available() else "cpu"

model = LatentUMModel.from_pretrained(
    "SJTU-DENG-Lab/LatentUM-Base", # alternative: "SJTU-DENG-Lab/LatentUM-GenEval"
    device = device,
    dtype  = dtype,
)
decoder = LatentUMDecoderModel.from_pretrained(
    "SJTU-DENG-Lab/LatentUM-Decoder",
    device=device,
    dtype=dtype,
)
images = model.generate_images(
    "a photo of a cute dog",
    decoder       = decoder,
    show_progress = True,
)
images[0].save("generated.png")
print("saved to generated.png")
PY

Visual Spatial Planning

uv run python - <<'PY'
import torch

from model.decoder import LatentUMDecoderModel
from model.latentum import LatentUMModel
from model.latentum.spatial_planning import run_frozenlake_demo

dtype = torch.bfloat16
device = "cuda" if torch.cuda.is_available() else "cpu"

model = LatentUMModel.from_pretrained(
    "SJTU-DENG-Lab/LatentUM-Vis-Plan",
    device=device,
    dtype=dtype,
)
decoder = LatentUMDecoderModel.from_pretrained(
    "SJTU-DENG-Lab/LatentUM-Decoder",
    device=device,
    dtype=dtype,
)
result = run_frozenlake_demo(
    model,
    decoder,
    image                    = "asset/frozenlake_level6_000.png",
    output_dir               = "asset/frozenlake_demo",
    max_steps                = 16,
    max_text_tokens_per_step = 10,
    temperature              = 0.7,
    top_k                    = 50,
    top_p                    = 0.95,
    gif_duration             = 500,
)
print(result["full_text"])
print("saved to asset/frozenlake_demo")
print(f"gif saved to {result['gif_path']}")
PY

World Modeling

python script/run_latentum_wm.py 

Interleaved SFT Example

See interleaved_sft_example.md for the full interleaved SFT example, including the JSONL data format, the shipped FrozenLake training sample in asset/frozenlake_interleaved_example/.

Citation

If you find this work useful, please cite:

@article{jin2026latentum,
  title={LatentUM: Unleashing the Potential of Interleaved Cross-Modal Reasoning via a Latent-Space Unified Model},
  author={Jin, Jiachun and Zhou, Zetong and Yang, Xiao and Zhang, Hao and Liu, Pengfei and Zhu, Jun and Deng, Zhijie},
  journal={arXiv preprint arXiv:2604.02097},
  year={2026}
}

Acknowledgements

We thank the authors of InternVL, BLIP3o, UniTok, and Stable Diffusion 3.5 for open-sourcing their great works!

License

This project is released under the Apache 2.0 License.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages