Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

After fine-tuning on my own dataset, the model performed poorly in the test set #79

Open
chenyt31 opened this issue Feb 9, 2025 · 2 comments

Comments

@chenyt31
Copy link

chenyt31 commented Feb 9, 2025

Thank you for your excellent work.

I encountered some issues while fine-tuning on my custom dataset (RLBench). The model performs well in the same scenarios as the training set, but the success rate drops to 0 in the test set. The test set scenarios are similar to those in the training set—the objects remain the same, but their positions or the colors of distracting objects are changed.

Training Configuration:

  • 100 demos used for training
  • Image resolution: 256 × 256
  • Camera viewpoint: front
  • Action space: 7 joint angles + 1 gripper open/close
  • Task: close_jar—pick up the lid and place it on the jar
  • Variations: All 100 demos perform the same skill, but the target jar’s color, the distracting jar’s color, and the positions of objects in the scene vary.
  • Data preprocessing follows a process similar to the ManiSkill scripts.
  • Checkpoint used for testing: checkpoint-132000
  • Training script:
    • Configuration: 6 × 4090 GPUs, 170M RDT
deepspeed --include localhost:1,2,3,4,5,6 main.py \
    --deepspeed="./configs/zero2.json" \
    --pretrained_model_name_or_path="../rdt_asset/rdt-170m" \
    --pretrained_text_encoder_name_or_path=$TEXT_ENCODER_NAME \
    --pretrained_vision_encoder_name_or_path=$VISION_ENCODER_NAME \
    --output_dir=$OUTPUT_DIR \
    --train_batch_size=24 \
    --sample_batch_size=24 \
    --max_train_steps=200000 \
    --checkpointing_period=1000 \
    --sample_period=500 \
    --checkpoints_total_limit=40 \
    --lr_scheduler="constant" \
    --learning_rate=1e-4 \
    --mixed_precision="bf16" \
    --dataloader_num_workers=8 \
    --image_aug \
    --dataset_type="finetune" \
    --state_noise_snr=40 \
    --load_from_hdf5 \
    --report_to=wandb \
    --precomp_lang_embed

Training Curve:

Image

Video:

train_success.mp4
test_fail.mp4

Do you have any insights into this issue? Thank you!

@CZtheHusky
Copy link

It seems that the video is not properly uploaded?

@chenyt31
Copy link
Author

It seems that the video is not properly uploaded?

Thanks, I have reuploaded the videos. They should work now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants