Skip to content

Conversation

@jadechoghari
Copy link
Member

@jadechoghari jadechoghari commented Nov 7, 2025

What this does

feat(policies): Add X-VLA

X-VLA was proposed here: https://thu-air-dream.github.io/X-VLA/ and won Champion @ AgiBot World Challenge @ IROS 2025

This is the full integration of it inside LeRobot by the LeRobot team
Libero also got updated to handle 1) different control mode, delta vs absolute - 2) you can now specify the max episode length, otherwise it will go to default depending on the task suite you choose

TODO:
Train and evaluate on libero and report success rate
Test on a real world task like picking, transfering a cube
Add testing

For finetuning / training
❄️ VLM vision encoder: FROZEN
❄️ VLM language encoder: FROZEN
🔥 Policy transformer: TRAINABLE
🔥 Soft prompts: TRAINABLE

@jadechoghari jadechoghari added the enhancement Suggestions for new features or improvements label Nov 7, 2025
Copilot AI review requested due to automatic review settings November 7, 2025 11:59
@jadechoghari jadechoghari added the policies Items related to robot policies label Nov 7, 2025
@jadechoghari jadechoghari self-assigned this Nov 7, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds XVLA (Extended Vision-Language-Action) policy support to LeRobot. XVLA is a multi-modal policy that combines vision, language, and proprioceptive inputs with a domain-aware transformer architecture for robot manipulation tasks.

Key changes:

  • Implements XVLA policy with Florence-2 vision-language backbone and soft-prompted transformer
  • Adds domain-aware action spaces (EE6D, Joint, AGIBOT) with specialized loss functions
  • Integrates XVLA into the LeRobot policy factory and configuration system

Reviewed Changes

Copilot reviewed 10 out of 11 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
train.sh Training script for XVLA with wandb and dataset configuration
test_xvla.py Test script to instantiate and verify XVLA policy
src/lerobot/policies/xvla/transformer.py Core transformer architecture with domain-aware layers and soft prompts
src/lerobot/policies/xvla/processing_xvla.py Multi-modal processor for images and language with padding/masking
src/lerobot/policies/xvla/modeling_xvla.py Main policy class implementing training/inference pipeline
src/lerobot/policies/xvla/modeling_florence2.py Florence-2 vision-language model (encoder/decoder)
src/lerobot/policies/xvla/configuration_xvla.py XVLA configuration with Florence2 integration
src/lerobot/policies/xvla/configuration_florence2.py Florence-2 model configuration classes
src/lerobot/policies/xvla/action_hub.py Action space registry with EE6D, Joint, AGIBOT variants
src/lerobot/policies/factory.py Factory integration for XVLA policy creation
src/lerobot/policies/__init__.py Export XVLA configuration

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link

@2toinf 2toinf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We recommend not freezing the vision and language encoders by default, as this approach may not align with the official implementation. In fact, freezing these two components often leads to a performance drop. We have observed that unfreezing them results in better task adaptation.
Additionally, we strongly advise applying a custom learning rate (typically 1/10th of the learning rate used for the VLM) during training, as suggested in the paper. This adjustment helps achieve optimal performance during fine-tuning.

@2toinf
Copy link

2toinf commented Nov 26, 2025

Further, I’d like to check X-VLA’s performance after post-training with the LeRobot pipeline. Does it match the officially reported results?

@jadechoghari
Copy link
Member Author

jadechoghari commented Nov 26, 2025

Hello @2toinf yes this is standard in lerobot, we run a reproducibility check where we compare the expected logits from the preprocessor with the logits produced by lerobot implementation, and we also compare the expected logits of the produced actions with those from the original implementation
See: https://github.com/huggingface/lerobot/blob/171d50e85478537cfcae721845293b17beffd41d/tests/policies/xvla/test_xvla_original_vs_lerobot.py

Along with our Libero benchmark checker

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Collaborator

@michel-aractingi michel-aractingi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall great work Jade. The PR is very close to review

"ninja>=1.11.1,<2.0.0",
"flash-attn>=2.5.9,<3.0.0 ; sys_platform != 'darwin'"
]
xlva = ["lerobot[transformers-dep]"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xlva typo...

instance = cls(config, **kwargs)
# step 2: locate model.safetensors
if os.path.isdir(model_id):
print("Loading weights from local directory")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use logging.info instead of print

except HfHubHTTPError as e:
raise FileNotFoundError(f"model.safetensors not found on the Hub at {model_id}") from e

print(f"Loading checkpoint from {model_file}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logging.info instead of print

# or deepcopy
# step 4: load into instance
instance.load_state_dict(state_dict, strict=True)
print("Loaded XVLA checkpoint")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

"""

domain_id: int = 0
device: str = "cuda"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems hardcoded, can I run xvla if I don't have a gpu?
Can use DeviceProcessorStep since its already the next step in the pipeline?

if obs:
for v in obs.values():
if isinstance(v, torch.Tensor):
batch_size = v.shape[0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can probaby infer the device from obs? device = v.device

)


if is_flash_attn_2_available():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

second condition if is_flash_attn...
Duplicated from line 56

"""The FLORENCE2 vision model without any head""",
FLORENCE2_START_DOCSTRING,
)
class Florence2VisionModel(Florence2PreTrainedModel):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class is unused? remove if true

"""The FLORENCE2 vision model with projection layer""",
FLORENCE2_START_DOCSTRING,
)
class Florence2VisionModelWithProjection(Florence2PreTrainedModel):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also this class is it used anywhere? remove if not used?


if len(self._queues[ACTION]) == 0:
actions = self._get_action_chunk(batch)
self._queues[ACTION].extend(actions.transpose(0, 1)[: self.config.n_action_steps])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verify that the actions are trimmed according to the requested action space as I had to manually trim it in the real robot test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement Suggestions for new features or improvements policies Items related to robot policies

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants