Skip to content

The official implementation of "Robo-MUTUAL: Robotic Multimodal Task Specification via Unimodal Learning"

License

Notifications You must be signed in to change notification settings

AIR-DI/Robo_MUTUAL

 
 

Repository files navigation

Robo-MUTUAL:
Robotic Multimodal Task Specification via Unimodal Learning

IEEE International Conference on Robotics and Automation (ICRA) 2025

The official implementation of "Robo-MUTUAL: Robotic Multimodal Task Specification via Unimodal Learning", which uses unimodal data to train a multimodal robotic policy.

Paper | Project page

Method

We propose Robo-MUTUAL (Robotic Multimodal Task specifications via Unimodal Learning). This new framework enhances the Cross-modality Alignment capability of existing multimodal encoders by consuming a broader spectrum of robot-relevant data. Specifically, we retrain DecisionNCE, a state-of-the-art robotic multimodal encoder on an all-encompassing dataset, which not only consists of large-scale robot datasets including Open-X and DROID, but also incorporates a large human-activity dataset EPIC-KITCHEN. Combined, these datasets form the most comprehensive collection to date for robotic multimodal encoder pretraining. Building on the pretrained encoders, we explore two training-free methods to bridge the modality gap within the representation space, where we further introduce an effective cosine-similarity noise to facilitate efficient data augmentation in representation space to enable generalization to new task prompts.

Tested across over 130 tasks and 4000 evaluations on both simulated LIBERO environments and real robot platforms, extensive experiments showcase a promising avenue towards enabling robots to understand multimodal instructions via unimodal training.

  • LIBERO Benchmark
  • Real World Experiments

For more details of method performance, you can visit our project page.

Quick Start

  1. First set up this repository

    conda create -n robo_mutual python=3.9 && conda activate robo_mutual
    git clone [email protected]:255isWhite/Robo_MUTUAL.git
    pip install -e . && pip install -r requirements.txt
    
    # download ResNet34 pretrained weights from huggingface
    git clone https://hf-mirror.com/timm/resnet34.a1_in1k # for Chinese mainland users
    git clone https://huggingface.co/timm/resnet34.a1_in1k # for others
    
    mv resnet34.a1_in1k models--timm--resnet34.a1_in1k
    mv models--timm--resnet34.a1_in1k ~/.cache/huggingface/hub/
  2. Then install LIBERO and download all datasets.

    unzip <LIBERO_datasets_zip> -d Robo_MUTUAL/data/libero/
    cd Robo_MUTUAL/data/libero/data_process
    python hdf2jpg.py # this will convert hdf5 to jpg
    python jpg2json-ac.py # this will format a json file
  3. Then set up DecisionNCE without downloading original checkpoints. Instead, please download this version.

    mkdir -p ~/.cache/DecisionNCE
    mv <above_downloaded_ckpt> DecisionNCE-T
    mv DecisionNCE-T ~/.cache/DecisionNCE
  4. We provide basic scripts of training and evaluation for LIBERO-GOAL, for training with language_goal

    cd <path to>/Robo_MUTUAL
    # First to change the wandb key 
    ./train_scripts/libero_goal_lang.sh

    You can see the evaluation results with both image_goal and language_goal in experiments folder.

  5. For training with image_goal

    # First to change the wandb key 
    ./train_scripts/libero_goal_img.sh
  6. For manually evaluation

    ./eval/eval_libero.sh

Citation

If you find our code or paper can help, please cite our paper as:

@article{li2024robo,
    title={Robo-MUTUAL: Robotic Multimodal Task Specification via Unimodal Learning},
    author={Li, Jianxiong and Wang, Zhihao and Zheng, Jinliang and Zhou, Xiaoai and Wang, Guanming and Song, Guanglu and Liu, Yu and Liu, Jingjing and Zhang, Ya-Qin and Yu, Junzhi and Zhan, Xianyuan},
    journal={arXiv preprint arXiv:2410.01529},
    year={2024}
}

Acknowledgement

Thanks to the great efforts of open-source community: LIBERO, DecisionNCE, BearRobot

License

All the code, model weights, and data are licensed under MIT license.

About

The official implementation of "Robo-MUTUAL: Robotic Multimodal Task Specification via Unimodal Learning"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 88.4%
  • HTML 9.0%
  • CSS 1.6%
  • Shell 1.0%