-
Notifications
You must be signed in to change notification settings - Fork 3.1k
feat(policies): Add X-VLA #2405
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds XVLA (Extended Vision-Language-Action) policy support to LeRobot. XVLA is a multi-modal policy that combines vision, language, and proprioceptive inputs with a domain-aware transformer architecture for robot manipulation tasks.
Key changes:
- Implements XVLA policy with Florence-2 vision-language backbone and soft-prompted transformer
- Adds domain-aware action spaces (EE6D, Joint, AGIBOT) with specialized loss functions
- Integrates XVLA into the LeRobot policy factory and configuration system
Reviewed Changes
Copilot reviewed 10 out of 11 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
train.sh |
Training script for XVLA with wandb and dataset configuration |
test_xvla.py |
Test script to instantiate and verify XVLA policy |
src/lerobot/policies/xvla/transformer.py |
Core transformer architecture with domain-aware layers and soft prompts |
src/lerobot/policies/xvla/processing_xvla.py |
Multi-modal processor for images and language with padding/masking |
src/lerobot/policies/xvla/modeling_xvla.py |
Main policy class implementing training/inference pipeline |
src/lerobot/policies/xvla/modeling_florence2.py |
Florence-2 vision-language model (encoder/decoder) |
src/lerobot/policies/xvla/configuration_xvla.py |
XVLA configuration with Florence2 integration |
src/lerobot/policies/xvla/configuration_florence2.py |
Florence-2 model configuration classes |
src/lerobot/policies/xvla/action_hub.py |
Action space registry with EE6D, Joint, AGIBOT variants |
src/lerobot/policies/factory.py |
Factory integration for XVLA policy creation |
src/lerobot/policies/__init__.py |
Export XVLA configuration |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
2toinf
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We recommend not freezing the vision and language encoders by default, as this approach may not align with the official implementation. In fact, freezing these two components often leads to a performance drop. We have observed that unfreezing them results in better task adaptation.
Additionally, we strongly advise applying a custom learning rate (typically 1/10th of the learning rate used for the VLM) during training, as suggested in the paper. This adjustment helps achieve optimal performance during fine-tuning.
|
Further, I’d like to check X-VLA’s performance after post-training with the LeRobot pipeline. Does it match the officially reported results? |
|
Hello @2toinf yes this is standard in lerobot, we run a reproducibility check where we compare the expected logits from the preprocessor with the logits produced by lerobot implementation, and we also compare the expected logits of the produced actions with those from the original implementation Along with our Libero benchmark checker |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Added detailed instructions for implementing a custom optimizer and modifying parameter retrieval for X-VLA finetuning. Signed-off-by: Jinliang Zheng <[email protected]>
michel-aractingi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall great work Jade. The PR is very close to review
| "ninja>=1.11.1,<2.0.0", | ||
| "flash-attn>=2.5.9,<3.0.0 ; sys_platform != 'darwin'" | ||
| ] | ||
| xlva = ["lerobot[transformers-dep]"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xlva typo...
| instance = cls(config, **kwargs) | ||
| # step 2: locate model.safetensors | ||
| if os.path.isdir(model_id): | ||
| print("Loading weights from local directory") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use logging.info instead of print
| except HfHubHTTPError as e: | ||
| raise FileNotFoundError(f"model.safetensors not found on the Hub at {model_id}") from e | ||
|
|
||
| print(f"Loading checkpoint from {model_file}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logging.info instead of print
| # or deepcopy | ||
| # step 4: load into instance | ||
| instance.load_state_dict(state_dict, strict=True) | ||
| print("Loaded XVLA checkpoint") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
| """ | ||
|
|
||
| domain_id: int = 0 | ||
| device: str = "cuda" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems hardcoded, can I run xvla if I don't have a gpu?
Can use DeviceProcessorStep since its already the next step in the pipeline?
| if obs: | ||
| for v in obs.values(): | ||
| if isinstance(v, torch.Tensor): | ||
| batch_size = v.shape[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can probaby infer the device from obs? device = v.device
| ) | ||
|
|
||
|
|
||
| if is_flash_attn_2_available(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
second condition if is_flash_attn...
Duplicated from line 56
| """The FLORENCE2 vision model without any head""", | ||
| FLORENCE2_START_DOCSTRING, | ||
| ) | ||
| class Florence2VisionModel(Florence2PreTrainedModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class is unused? remove if true
| """The FLORENCE2 vision model with projection layer""", | ||
| FLORENCE2_START_DOCSTRING, | ||
| ) | ||
| class Florence2VisionModelWithProjection(Florence2PreTrainedModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also this class is it used anywhere? remove if not used?
|
|
||
| if len(self._queues[ACTION]) == 0: | ||
| actions = self._get_action_chunk(batch) | ||
| self._queues[ACTION].extend(actions.transpose(0, 1)[: self.config.n_action_steps]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Verify that the actions are trimmed according to the requested action space as I had to manually trim it in the real robot test
What this does
feat(policies): Add X-VLA
X-VLA was proposed here: https://thu-air-dream.github.io/X-VLA/ and won Champion @ AgiBot World Challenge @ IROS 2025
This is the full integration of it inside LeRobot by the LeRobot team
Libero also got updated to handle 1) different control mode, delta vs absolute - 2) you can now specify the max episode length, otherwise it will go to default depending on the task suite you choose
TODO:
Train and evaluate on libero and report success rateTest on a real world task like picking, transfering a cubeAdd testingFor finetuning / training
❄️ VLM vision encoder: FROZEN
❄️ VLM language encoder: FROZEN
🔥 Policy transformer: TRAINABLE
🔥 Soft prompts: TRAINABLE