Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix backward error in DDP when running reward model finetune in RLHF #507

Merged
merged 1 commit into from
Jan 8, 2024

Conversation

sywangyi
Copy link
Collaborator

@sywangyi sywangyi commented Nov 3, 2023

I am enabling RLHF in habana, when enable reward model finetuning in 8 gaudi2 card using DDP. error happened in backward.

code like
https://github.com/intel/intel-extension-for-transformers/blob/main/intel_extension_for_transformers/neural_chat/examples/finetuning/ppo_pipeline/reward_modeling.py

command like
python ../instruction/gaudi_spawn.py --world_size 8 --use_mpi reward_modeling.py --model_name_or_path meta-llama/Llama-2-7b-hf --log_level info --num_train_epochs 3 --use_habana --output_dir output --ddp_find_unused_parameters True --logging_steps 10 --use_lazy_mode --evaluation_strategy="steps"

error like
Traceback (most recent call last):
File "/root/intel-extension-for-transformers/intel_extension_for_transformers/neural_chat/examples/finetuning/ppo_pipeline/reward_modeling.py", line 475, in
trainer.train()
File "/intel-extension-for-transformers/optimum-habana/optimum/habana/transformers/trainer.py", line 504, in train
return inner_training_loop(
File "/intel-extension-for-transformers/optimum-habana/optimum/habana/transformers/trainer.py", line 837, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/intel-extension-for-transformers/optimum-habana/optimum/habana/transformers/trainer.py", line 1361, in training_step
self.accelerator.backward(loss)
File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1989, in backward
loss.backward(**kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/_tensor.py", line 498, in backward
torch.autograd.backward(
File "/usr/local/lib/python3.10/dist-packages/torch/autograd/init.py", line 200, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 274, in apply
return user_fn(self, *args)
File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpex/kernels/RotaryPosEmbeddingHelper.py", line 157, in backward
cos, sin, position_ids = ctx.saved_tensors
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [HPUBFloat16Type [1, 1, 512, 128]] is at version 3; expected version 2 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
0%| | 0/354 [00:07<?, ?it/s]

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.

mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[3851,1],5]
Exit code: 1

@sywangyi sywangyi requested a review from a user November 3, 2023 04:07
@sywangyi
Copy link
Collaborator Author

sywangyi commented Nov 3, 2023

@regisss

@sywangyi
Copy link
Collaborator Author

sywangyi commented Nov 3, 2023

I don't know why. it only occurs in multi-card case. single card does not have such issue.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@sywangyi
Copy link
Collaborator Author

sywangyi commented Nov 3, 2023

@yao-matrix

@sywangyi sywangyi changed the title fix backward error in DDP when runining reward model finetune in RLHF fix backward error in DDP when running reward model finetune in RLHF Nov 3, 2023
@regisss
Copy link
Collaborator

regisss commented Nov 3, 2023

@sywangyi Weird that it only occurs with DDP indeed. Does it rely on the trl library?

@mandy-li Have you ever seen this behaviour with FusedRoPE?

@sywangyi
Copy link
Collaborator Author

sywangyi commented Nov 3, 2023

@sywangyi
Copy link
Collaborator Author

sywangyi commented Nov 3, 2023

the reward modeling compute_loss is a little different from normal. see https://github.com/huggingface/trl/blob/main/examples/research_projects/stack_llama/scripts/reward_modeling.py#L268-L271. not sure if this is the cause of the issue

@mandy-li
Copy link
Collaborator

mandy-li commented Nov 3, 2023

@regisss , no, i've never seen this behavior before.

@regisss
Copy link
Collaborator

regisss commented Nov 6, 2023

@sywangyi Waiting for #475 to be merged (it should happen soon) so that I can run an up-to-date CI on this branch. I'll let you know when this is done.

@sywangyi
Copy link
Collaborator Author

@regisss, will you help merge the PR? I am enabling RLHF (PPO) in Gaudi2, basic function is working now for reward modeling and reinforcement learning, and performance is optimistic. later I would like to clean the code and upload the PPO and DPO related example to optimum-habana.

@mandy-li
Copy link
Collaborator

@sywangyi , please file a jira to Habana with a simple test case to reproduce the problem. We need to investigate the root cause before we merge any workaround.

@sywangyi
Copy link
Collaborator Author

@mandy-li have filed a jira in habana jira system

@sywangyi
Copy link
Collaborator Author

sywangyi commented Nov 30, 2023

still see this issue in SW release 1.13. Per @mandy-li,we could merge it as WA and remove it once the problem is fixed in Synapse, I test by my side, not see performance regress in finetune and inference side.
see https://habana.atlassian.net/servicedesk/customer/portal/1/HS-1253 for the details

@regisss
Copy link
Collaborator

regisss commented Nov 30, 2023

I cannot access HS-1253 so I'll let @mandy-li and @libinta decide the way to go here

@sywangyi
Copy link
Collaborator Author

sywangyi commented Dec 1, 2023

@mandy-li could you comment on this?

@sywangyi sywangyi mentioned this pull request Dec 28, 2023
3 tasks
@mandy-li
Copy link
Collaborator

mandy-li commented Jan 4, 2024

@mandy-li could you comment on this?

@regisss , @sywangyi , it is ok for me to use the workaround. The problem is targeted to be fixed in SynapseAI in 1.15 because of this workaround. @sywangyi , you didn't see any perf degradation with the workaround (e.g extra clone ops), right?

@sywangyi
Copy link
Collaborator Author

sywangyi commented Jan 8, 2024

@mandy-li could you comment on this?

@regisss , @sywangyi , it is ok for me to use the workaround. The problem is targeted to be fixed in SynapseAI in 1.15 because of this workaround. @sywangyi , you didn't see any perf degradation with the workaround (e.g extra clone ops), right?

yes, I didn't see the perf degradation

@regisss
Copy link
Collaborator

regisss commented Jan 8, 2024

Sounds good!
@sywangyi Could you add the following comment right above the line you modified please?

# TODO: remove `.clone()` when SynapseAI v1.15 is released

And then I'll merge it!

@regisss regisss merged commit fe51573 into main Jan 8, 2024
9 checks passed
@regisss regisss deleted the ddp_rm branch January 8, 2024 09:25
jychen21 pushed a commit to jychen21/optimum-habana that referenced this pull request Feb 27, 2024
regisss added a commit that referenced this pull request Mar 5, 2024
puneeshkhanna pushed a commit to puneeshkhanna/optimum-habana-fork that referenced this pull request Mar 11, 2024
HolyFalafel pushed a commit to HabanaAI/optimum-habana-fork that referenced this pull request Mar 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants