Robo-MUTUAL:
Robotic Multimodal Task Specification via Unimodal Learning

IEEE International Conference on Robotics and Automation (ICRA) 2025

The official implementation of "Robo-MUTUAL: Robotic Multimodal Task Specification via Unimodal Learning", which uses unimodal data to train a multimodal robotic policy.

Paper | Project page

Method

We propose Robo-MUTUAL (Robotic Multimodal Task specifications via Unimodal Learning). This new framework enhances the Cross-modality Alignment capability of existing multimodal encoders by consuming a broader spectrum of robot-relevant data. Specifically, we retrain DecisionNCE, a state-of-the-art robotic multimodal encoder on an all-encompassing dataset, which not only consists of large-scale robot datasets including Open-X and DROID, but also incorporates a large human-activity dataset EPIC-KITCHEN. Combined, these datasets form the most comprehensive collection to date for robotic multimodal encoder pretraining. Building on the pretrained encoders, we explore two training-free methods to bridge the modality gap within the representation space, where we further introduce an effective cosine-similarity noise to facilitate efficient data augmentation in representation space to enable generalization to new task prompts.

Tested across over 130 tasks and 4000 evaluations on both simulated LIBERO environments and real robot platforms, extensive experiments showcase a promising avenue towards enabling robots to understand multimodal instructions via unimodal training.

LIBERO Benchmark

Real World Experiments

For more details of method performance, you can visit our project page.

Quick Start

First set up this repository

conda create -n robo_mutual python=3.9 && conda activate robo_mutual
git clone [email protected]:255isWhite/Robo_MUTUAL.git
pip install -e . && pip install -r requirements.txt

# download ResNet34 pretrained weights from huggingface
git clone https://hf-mirror.com/timm/resnet34.a1_in1k # for Chinese mainland users
git clone https://huggingface.co/timm/resnet34.a1_in1k # for others

mv resnet34.a1_in1k models--timm--resnet34.a1_in1k
mv models--timm--resnet34.a1_in1k ~/.cache/huggingface/hub/

Then install LIBERO and download all datasets.

unzip <LIBERO_datasets_zip> -d Robo_MUTUAL/data/libero/
cd Robo_MUTUAL/data/libero/data_process
python hdf2jpg.py # this will convert hdf5 to jpg
python jpg2json-ac.py # this will format a json file

Then set up DecisionNCE without downloading original checkpoints. Instead, please download this version.

mkdir -p ~/.cache/DecisionNCE
mv <above_downloaded_ckpt> DecisionNCE-T
mv DecisionNCE-T ~/.cache/DecisionNCE

We provide basic scripts of training and evaluation for LIBERO-GOAL, for training with language_goal
```
cd <path to>/Robo_MUTUAL
# First to change the wandb key 
./train_scripts/libero_goal_lang.sh
```
You can see the evaluation results with both image_goal and language_goal in experiments folder.

For training with image_goal

# First to change the wandb key 
./train_scripts/libero_goal_img.sh

For manually evaluation
```
./eval/eval_libero.sh
```

Citation

If you find our code or paper can help, please cite our paper as:

@article{li2024robo,
    title={Robo-MUTUAL: Robotic Multimodal Task Specification via Unimodal Learning},
    author={Li, Jianxiong and Wang, Zhihao and Zheng, Jinliang and Zhou, Xiaoai and Wang, Guanming and Song, Guanglu and Liu, Yu and Liu, Jingjing and Zhang, Ya-Qin and Yu, Junzhi and Zhan, Xianyuan},
    journal={arXiv preprint arXiv:2410.01529},
    year={2024}
}

Acknowledgement

Thanks to the great efforts of open-source community: LIBERO, DecisionNCE, BearRobot

License

All the code, model weights, and data are licensed under MIT license.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
BearRobot		BearRobot
assets		assets
config		config
data/libero/data_process		data/libero/data_process
eval		eval
train_scripts		train_scripts
.gitignore		.gitignore
CNAME		CNAME
LICENSE		LICENSE
README.md		README.md
index.html		index.html
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Robo-MUTUAL:
Robotic Multimodal Task Specification via Unimodal Learning

Method

Quick Start

Citation

Acknowledgement

License

About

Releases

Packages

Languages

License

AIR-DI/Robo_MUTUAL

Folders and files

Latest commit

History

Repository files navigation

Robo-MUTUAL:Robotic Multimodal Task Specification via Unimodal Learning

Method

Quick Start

Citation

Acknowledgement

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Robo-MUTUAL:
Robotic Multimodal Task Specification via Unimodal Learning

Packages