Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix dpo graph compile error in evaluation #630

Merged
merged 1 commit into from
Jan 9, 2024
Merged

fix dpo graph compile error in evaluation #630

merged 1 commit into from
Jan 9, 2024

Conversation

sywangyi
Copy link
Collaborator

@sywangyi sywangyi commented Jan 9, 2024

fix graph compilation issue in evaluation

@sywangyi sywangyi requested a review from regisss as a code owner January 9, 2024 13:10
@sywangyi
Copy link
Collaborator Author

sywangyi commented Jan 9, 2024

@libinta

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Collaborator

@regisss regisss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

What is the error you got? A wrong dtype in the graph?

@sywangyi
Copy link
Collaborator Author

sywangyi commented Jan 9, 2024

LGTM

What is the error you got? A wrong dtype in the graph?

File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 3066, in evaluate
output = eval_loop(
File "/usr/local/lib/python3.10/dist-packages/trl/trainer/dpo_trainer.py", line 1119, in evaluation_loop
initial_output = super().evaluation_loop(
File "/intel-extension-for-transformers/optimum-habana/optimum/habana/transformers/trainer.py", line 1578, in evaluation_loop
loss, logits, labels = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys)
File "/usr/local/lib/python3.10/dist-packages/trl/trainer/dpo_trainer.py", line 1051, in prediction_step
loss, metrics = self.get_batch_loss_metrics(model, inputs, train_eval="eval")
File "/usr/local/lib/python3.10/dist-packages/trl/trainer/dpo_trainer.py", line 948, in get_batch_loss_metrics
) = self.concatenated_forward(self.ref_model, batch)
File "/intel-extension-for-transformers/optimum-habana/optimum/habana/trl/trainer/dpo_trainer.py", line 406, in concatenated_forward
all_logits = model(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1521, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1530, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/graphs.py", line 676, in forward
return wrapped_hpugraph_forward(cache, stream, orig_fwd, args, kwargs, disable_tensor_cache, asynchronous, dry_run, max_graphs, hash_with_views)
File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/graphs.py", line 586, in wrapped_hpugraph_forward
cached.graph.replay(cached.asynchronous)
File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/graphs.py", line 47, in replay
_hpu_C.replay(self.hpu_graph, asynchronous)
RuntimeError: Graph compile failed. synStatus=synStatus 26 [Generice failure].

Graph compile failure in evaluation

@regisss
Copy link
Collaborator

regisss commented Jan 9, 2024

Thanks, the error message is not very explicit 😁

@regisss regisss merged commit 4e67153 into main Jan 9, 2024
7 of 8 checks passed
@regisss regisss deleted the graph_error_dpo branch January 9, 2024 14:18
jychen21 pushed a commit to jychen21/optimum-habana that referenced this pull request Feb 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants