
IEEE International Conference on Robotics and Automation (ICRA) 2025
The official implementation of "Robo-MUTUAL: Robotic Multimodal Task Specification via Unimodal Learning", which uses unimodal data to train a multimodal robotic policy.
We propose Robo-MUTUAL (Robotic Multimodal Task specifications via Unimodal Learning). This new framework enhances the Cross-modality Alignment capability of existing multimodal encoders by consuming a broader spectrum of robot-relevant data. Specifically, we retrain DecisionNCE, a state-of-the-art robotic multimodal encoder on an all-encompassing dataset, which not only consists of large-scale robot datasets including Open-X and DROID, but also incorporates a large human-activity dataset EPIC-KITCHEN. Combined, these datasets form the most comprehensive collection to date for robotic multimodal encoder pretraining. Building on the pretrained encoders, we explore two training-free methods to bridge the modality gap within the representation space, where we further introduce an effective cosine-similarity noise to facilitate efficient data augmentation in representation space to enable generalization to new task prompts.
Tested across over 130 tasks and 4000 evaluations on both simulated LIBERO environments and real robot platforms, extensive experiments showcase a promising avenue towards enabling robots to understand multimodal instructions via unimodal training.
- LIBERO Benchmark
- Real World Experiments
For more details of method performance, you can visit our project page.
-
First set up this repository
conda create -n robo_mutual python=3.9 && conda activate robo_mutual git clone [email protected]:255isWhite/Robo_MUTUAL.git pip install -e . && pip install -r requirements.txt # download ResNet34 pretrained weights from huggingface git clone https://hf-mirror.com/timm/resnet34.a1_in1k # for Chinese mainland users git clone https://huggingface.co/timm/resnet34.a1_in1k # for others mv resnet34.a1_in1k models--timm--resnet34.a1_in1k mv models--timm--resnet34.a1_in1k ~/.cache/huggingface/hub/
-
Then install LIBERO and download all datasets.
unzip <LIBERO_datasets_zip> -d Robo_MUTUAL/data/libero/ cd Robo_MUTUAL/data/libero/data_process python hdf2jpg.py # this will convert hdf5 to jpg python jpg2json-ac.py # this will format a json file
-
Then set up DecisionNCE without downloading original checkpoints. Instead, please download this version.
mkdir -p ~/.cache/DecisionNCE mv <above_downloaded_ckpt> DecisionNCE-T mv DecisionNCE-T ~/.cache/DecisionNCE
-
We provide basic scripts of training and evaluation for LIBERO-GOAL, for training with language_goal
cd <path to>/Robo_MUTUAL # First to change the wandb key ./train_scripts/libero_goal_lang.sh
You can see the evaluation results with both image_goal and language_goal in experiments folder.
-
For training with image_goal
# First to change the wandb key ./train_scripts/libero_goal_img.sh
-
For manually evaluation
./eval/eval_libero.sh
If you find our code or paper can help, please cite our paper as:
@article{li2024robo,
title={Robo-MUTUAL: Robotic Multimodal Task Specification via Unimodal Learning},
author={Li, Jianxiong and Wang, Zhihao and Zheng, Jinliang and Zhou, Xiaoai and Wang, Guanming and Song, Guanglu and Liu, Yu and Liu, Jingjing and Zhang, Ya-Qin and Yu, Junzhi and Zhan, Xianyuan},
journal={arXiv preprint arXiv:2410.01529},
year={2024}
}
Thanks to the great efforts of open-source community: LIBERO, DecisionNCE, BearRobot
All the code, model weights, and data are licensed under MIT license.