You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The following code in PPOAgent.compute_advantages ignores value predictions for final observations in the trajectory and instead passes one-before-last values to the generalized_advantage_estimation function twice:
# Arg value_preds was appended with final next_step value. Make tensors# next_value_preds by stripping first and last elements respectively.value_preds=value_preds[:, :-1]
ifself._use_gae:
advantages=value_ops.generalized_advantage_estimation(
values=value_preds,
final_value=value_preds[:, -1],
rewards=rewards,
discounts=discounts,
td_lambda=self._lambda,
time_major=False,
)
Instead, final_value should be extracted before value_preds are stripped, e.g.:
The following code in
PPOAgent.compute_advantages
ignores value predictions for final observations in the trajectory and instead passes one-before-last values to thegeneralized_advantage_estimation
function twice:Instead,
final_value
should be extracted beforevalue_preds
are stripped, e.g.:Also, the comment about
next_value_preds
doesn't match the code so it could be improved.The text was updated successfully, but these errors were encountered: