Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Average return very low in tf.DDPG #1077

Open
surbhi1944 opened this issue Nov 27, 2019 · 4 comments · May be fixed by #1981
Open

Average return very low in tf.DDPG #1077

surbhi1944 opened this issue Nov 27, 2019 · 4 comments · May be fixed by #1981
Assignees
Labels
bug Something isn't working tf
Milestone

Comments

@surbhi1944
Copy link

surbhi1944 commented Nov 27, 2019

Formula for off_policy_method:

total_timeseteps= n_epochs * n_epoch_cycles * batch_size

then if
n_epochs=1400
n_epoch_cycles=20
batch_size=64
min_buffer_size=10^6
then total_timesteps=1400 * 20 * 64=1,792,000

I obtained the graph as shown in figure for DDPG_Walker2d-v3. It is showing very less average return. But most of the research papers shows ~2500 average_return on 1 million timesteps. How to set the parameters to get near about results.
My Code:

import gym
import tensorflow as tf
import time
from garage.experiment import run_experiment
from garage.np.exploration_strategies import OUStrategy
from garage.replay_buffer import SimpleReplayBuffer
from garage.tf.algos import DDPG
from garage.tf.envs import TfEnv
from garage.tf.experiment import LocalTFRunner
from garage.tf.policies import ContinuousMLPPolicy
from garage.tf.q_functions import ContinuousMLPQFunction
import random
from datetime import datetime, timedelta
import numpy as np
import os

def run_task(snapshot_config, *_):
    """Run task."""

    with LocalTFRunner(snapshot_config=snapshot_config) as runner:
        env=gym.make('Walker2d-v3')
        env = TfEnv(env)
        action_noise = OUStrategy(env.spec, sigma=0.2)

        policy = ContinuousMLPPolicy(env_spec=env.spec,
                                     hidden_sizes=[400, 300],
                                     hidden_nonlinearity=tf.nn.relu,
                                     output_nonlinearity=tf.nn.tanh)

        qf = ContinuousMLPQFunction(env_spec=env.spec,
                                    hidden_sizes=[400,300],
                                    hidden_nonlinearity=tf.nn.relu)

        replay_buffer = SimpleReplayBuffer(env_spec=env.spec,
                                           size_in_transitions=int(1e6),
                                           time_horizon=100)

        ddpg = DDPG(env_spec=env.spec,
                    policy=policy,
                    policy_lr=1e-4,
                    qf_lr=1e-3,
                    qf=qf,
                    replay_buffer=replay_buffer,
                    target_update_tau=1e-3,
                    n_train_steps=50,
                    discount=0.99,
                    buffer_batch_size=64,
                    n_epoch_cycles=20,
                    min_buffer_size=int(1e6),
                    exploration_strategy=action_noise,
                    policy_optimizer=tf.train.AdamOptimizer,
                    qf_weight_decay=0.01,
                    qf_optimizer=tf.train.AdamOptimizer)

        runner.setup(algo=ddpg, env=env)

        runner.train(n_epochs=2000, n_epoch_cycles=20, batch_size=64)

sed=[21]
for difsed in range(1):
    i=0
    seed=sed[difsed]
    #for i in range(2):
    start_time=time.time()
    run_experiment(
        run_task,
        snapshot_mode='last',
        seed=seed,
        exp_name=str(seed)+"_"+str(i),
        log_dir=r"/home/surabhi/Downloads/github/garage/result/ddpg/walk-v22/"+str(seed)+"/"+str(i)+"/"
    )
        #print("Time: ",timedelta(seconds=time.time()-start_time))
    file=open(r"/home/surabhi/Downloads/github/garage/result/ddpg/walk-v22/time.txt","a")
    file.write('seed '+str(seed)+' itr '+str(i)+' start '+str(start_time)+' elapsed '+str(timedelta(seconds=time.time()-start_time))+"\n")
    file.close()

f

I found that it starts evaluation from 782epoch ( bcz 1000000//(20*64) ). Hence https://github.com/rlworkgroup/garage/blob/master/src/garage/tf/algos/ddpg.py condition on line 272 will be true from this epoch and policy optimization will be started from this point onward. Is this the reason for not getting results?

Another question i want to ask is: why this evaluation loop (line 271 of DDPG.py) is repeated n_train_steps (training steps) times? what is the purpose of evaluating n_train_steps times. Is this doing a kind of rollout repeated n_train_steps times? Where length of rollout is either end of episode or a trajectory of length =batch_size=64. (Line 173 of https://github.com/rlworkgroup/garage/blob/master/src/garage/sampler/off_policy_vectorized_sampler.py#L66 ?

@krzentner krzentner changed the title Too less avg_retrun Average return very low in tf.DDPG Dec 5, 2019
@krzentner
Copy link
Contributor

krzentner commented Dec 5, 2019

Hi Surbhi1944, thanks for opening this issue.

Optimization starts at epoch 782 because you've set min_buffer_size to 1000000. Usually, when people are bechmarking this task, they set this parameter much lower. For example, our bechmarks set it to 10000 for all Mujoco tasks. I believe that this is why you're seeing such low performance.

About your other question, n_train_steps is a parameter we use so that we can make our epoch size the same as other implementations, which we are working on removing. Soon, we will be logging performance based on number of time steps, which should make it easier to compare performance.

I do believe that our implementations of DDPG should perform much better than this, as indicated by this benchmark result below. If you find that it doesn't after changing min_buffer_size, then I can look into it further. By the way, in which papers do you see DDPG get an average return of 2500 after 1M time steps on Walker2d? At least in Soft Actor-Critic Algorithms and Applications and in Deep Reinforcement Learning that Matters, the expected average return is a little over 1000.

Hopefully that answers your questions, but please let me know if you have any others or there was something I missed.

@surbhi1944
Copy link
Author

Thanks for the reply.

Now i got the purpose of min_buffer_size variable. But still confused for n_train_steps. I think this is representing the number of times we want the updation in the weights of neural network (1 time for 1 batch). As more its value means more time optimize_policy function (line275 of https://github.com/rlworkgroup/garage/blob/master/src/garage/tf/algos/ddpg.py) will be called. Hence more time the weights will be updated. Please correct me if i am wrong.

If we want to do training on environment such as Humanoid (that need ~10million timesteps) then should we increase the n_train_step or only n_epoch and n_epoch_cycle

I saw:

  1. ~2000 average return of DDPG on Walker2d in 1million timesteps in the research paper titled "Addressing Function Approximation Error in Actor-Critic Methods".

  2. ~1800 average episodic return in 1 million timesteps in research paper titled: "Mutual-Information Regularization in Markov Decision Processes and Actor-Critic Learning".

  3. ~1800 average return in 1 million timesteps in the below link:
    https://spinningup.openai.com/en/latest/spinningup/bench.html#id10
    image

image

The results presented in the graph shown in your graph(above) also does not reach to even near about and converges at ~220 (garage_tf_trial1_seed30). Is there any formula to make comparison of ~220 (yours) to ~1000 (others). or what is reported in the above graph: return of single episode, return over multiple batches, average return over previous 100 episode, average return over 1000 timesteps, or other thing?

@avnishn
Copy link
Member

avnishn commented Dec 6, 2019

Hello all, it seems that we have an issue in the way that we log average returns over time.
The issue seems to be over here:

self._episode_rewards.extend([

Over time essentially we're computing an average over all the returns that we our sampler has observed from rolling out the policy. We should only be calculating average returns over a certain period of recent training epochs (30 or 100, not all 500-1000?)

We'll make a fix and re run our baselines to verify that this is the case. Thank you @surbhi1944 .

@ryanjulian ryanjulian added this to the v2020.02rc2 milestone Dec 7, 2019
@ryanjulian ryanjulian added the bug Something isn't working label Dec 11, 2019
@ryanjulian
Copy link
Member

Quick update -- we were able to confirm your report and found lackluster performance in our tf/DDPG implementation. We are now auditing our implementation and making this fix the highest priority.

We're also benchmarking our torch/DDPG implementation to confirm whether or not the bug is shared.

We'll keep this issue updated.

@ryanjulian ryanjulian modified the milestones: v2020.02rc2, v2020.02.0 Feb 27, 2020
@ryanjulian ryanjulian assigned krzentner and unassigned zequnyu and avnishn Feb 27, 2020
@ryanjulian ryanjulian added the backport-to-2019.10 Backport this PR to release-2019.10 label Mar 5, 2020
@ryanjulian ryanjulian assigned maliesa96 and unassigned krzentner May 20, 2020
@ryanjulian ryanjulian added the tf label Jun 11, 2020
@ryanjulian ryanjulian added this to the v2020.09rc2 milestone Jun 18, 2020
@maliesa96 maliesa96 removed their assignment Jul 8, 2020
@ryanjulian ryanjulian modified the milestones: v2020.09rc2, v2020.09rc3 Jul 22, 2020
@avnishn avnishn self-assigned this Aug 4, 2020
@ryanjulian ryanjulian assigned irisliucy and unassigned avnishn Aug 5, 2020
@irisliucy irisliucy assigned irisliucy and unassigned irisliucy Aug 10, 2020
@ryanjulian ryanjulian modified the milestones: v2020.09rc3, v2020.09rc4 Aug 17, 2020
@irisliucy irisliucy linked a pull request Aug 28, 2020 that will close this issue
3 tasks
@ryanjulian ryanjulian modified the milestones: v2020.09rc4, v2020.09rc5 Sep 2, 2020
@irisliucy irisliucy linked a pull request Sep 11, 2020 that will close this issue
3 tasks
@ryanjulian ryanjulian modified the milestones: v2020.09rc5, v2020.10.0rc6 Sep 15, 2020
@ryanjulian ryanjulian modified the milestones: v2020.10.0rc7, v2020.10.0 Oct 30, 2020
@ryanjulian ryanjulian removed the backport-to-2019.10 Backport this PR to release-2019.10 label Nov 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working tf
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants