Incorrect calculation of generalized advantage estimates in PPO #953

zhezherun · 2025-02-12T13:59:59Z

The following code in PPOAgent.compute_advantages ignores value predictions for final observations in the trajectory and instead passes one-before-last values to the generalized_advantage_estimation function twice:

    # Arg value_preds was appended with final next_step value. Make tensors
    #   next_value_preds by stripping first and last elements respectively.
    value_preds = value_preds[:, :-1]
    if self._use_gae:
      advantages = value_ops.generalized_advantage_estimation(
          values=value_preds,
          final_value=value_preds[:, -1],
          rewards=rewards,
          discounts=discounts,
          td_lambda=self._lambda,
          time_major=False,
      )

Instead, final_value should be extracted before value_preds are stripped, e.g.:

    final_value_preds = value_preds[:, -1]
    value_preds = value_preds[:, :-1]
    if self._use_gae:
      advantages = value_ops.generalized_advantage_estimation(
          values=value_preds,
          final_value=final_value_preds,
          rewards=rewards,
          discounts=discounts,
          td_lambda=self._lambda,
          time_major=False,
      )

Also, the comment about next_value_preds doesn't match the code so it could be improved.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect calculation of generalized advantage estimates in PPO #953

Incorrect calculation of generalized advantage estimates in PPO #953

zhezherun commented Feb 12, 2025

Incorrect calculation of generalized advantage estimates in PPO #953

Incorrect calculation of generalized advantage estimates in PPO #953

Comments

zhezherun commented Feb 12, 2025