Average return very low in tf.DDPG #1077

surbhi1944 · 2019-11-27T07:02:13Z

Formula for off_policy_method:

total_timeseteps= n_epochs * n_epoch_cycles * batch_size

then if
n_epochs=1400
n_epoch_cycles=20
batch_size=64
min_buffer_size=10^6
then total_timesteps=1400 * 20 * 64=1,792,000

I obtained the graph as shown in figure for DDPG_Walker2d-v3. It is showing very less average return. But most of the research papers shows ~2500 average_return on 1 million timesteps. How to set the parameters to get near about results.
My Code:

import gym
import tensorflow as tf
import time
from garage.experiment import run_experiment
from garage.np.exploration_strategies import OUStrategy
from garage.replay_buffer import SimpleReplayBuffer
from garage.tf.algos import DDPG
from garage.tf.envs import TfEnv
from garage.tf.experiment import LocalTFRunner
from garage.tf.policies import ContinuousMLPPolicy
from garage.tf.q_functions import ContinuousMLPQFunction
import random
from datetime import datetime, timedelta
import numpy as np
import os

def run_task(snapshot_config, *_):
    """Run task."""

    with LocalTFRunner(snapshot_config=snapshot_config) as runner:
        env=gym.make('Walker2d-v3')
        env = TfEnv(env)
        action_noise = OUStrategy(env.spec, sigma=0.2)

        policy = ContinuousMLPPolicy(env_spec=env.spec,
                                     hidden_sizes=[400, 300],
                                     hidden_nonlinearity=tf.nn.relu,
                                     output_nonlinearity=tf.nn.tanh)

        qf = ContinuousMLPQFunction(env_spec=env.spec,
                                    hidden_sizes=[400,300],
                                    hidden_nonlinearity=tf.nn.relu)

        replay_buffer = SimpleReplayBuffer(env_spec=env.spec,
                                           size_in_transitions=int(1e6),
                                           time_horizon=100)

        ddpg = DDPG(env_spec=env.spec,
                    policy=policy,
                    policy_lr=1e-4,
                    qf_lr=1e-3,
                    qf=qf,
                    replay_buffer=replay_buffer,
                    target_update_tau=1e-3,
                    n_train_steps=50,
                    discount=0.99,
                    buffer_batch_size=64,
                    n_epoch_cycles=20,
                    min_buffer_size=int(1e6),
                    exploration_strategy=action_noise,
                    policy_optimizer=tf.train.AdamOptimizer,
                    qf_weight_decay=0.01,
                    qf_optimizer=tf.train.AdamOptimizer)

        runner.setup(algo=ddpg, env=env)

        runner.train(n_epochs=2000, n_epoch_cycles=20, batch_size=64)

sed=[21]
for difsed in range(1):
    i=0
    seed=sed[difsed]
    #for i in range(2):
    start_time=time.time()
    run_experiment(
        run_task,
        snapshot_mode='last',
        seed=seed,
        exp_name=str(seed)+"_"+str(i),
        log_dir=r"/home/surabhi/Downloads/github/garage/result/ddpg/walk-v22/"+str(seed)+"/"+str(i)+"/"
    )
        #print("Time: ",timedelta(seconds=time.time()-start_time))
    file=open(r"/home/surabhi/Downloads/github/garage/result/ddpg/walk-v22/time.txt","a")
    file.write('seed '+str(seed)+' itr '+str(i)+' start '+str(start_time)+' elapsed '+str(timedelta(seconds=time.time()-start_time))+"\n")
    file.close()

I found that it starts evaluation from 782epoch ( bcz 1000000//(20*64) ). Hence https://github.com/rlworkgroup/garage/blob/master/src/garage/tf/algos/ddpg.py condition on line 272 will be true from this epoch and policy optimization will be started from this point onward. Is this the reason for not getting results?

Another question i want to ask is: why this evaluation loop (line 271 of DDPG.py) is repeated n_train_steps (training steps) times? what is the purpose of evaluating n_train_steps times. Is this doing a kind of rollout repeated n_train_steps times? Where length of rollout is either end of episode or a trajectory of length =batch_size=64. (Line 173 of https://github.com/rlworkgroup/garage/blob/master/src/garage/sampler/off_policy_vectorized_sampler.py#L66 ?

krzentner · 2019-12-05T02:25:44Z

Hi Surbhi1944, thanks for opening this issue.

Optimization starts at epoch 782 because you've set min_buffer_size to 1000000. Usually, when people are bechmarking this task, they set this parameter much lower. For example, our bechmarks set it to 10000 for all Mujoco tasks. I believe that this is why you're seeing such low performance.

About your other question, n_train_steps is a parameter we use so that we can make our epoch size the same as other implementations, which we are working on removing. Soon, we will be logging performance based on number of time steps, which should make it easier to compare performance.

I do believe that our implementations of DDPG should perform much better than this, as indicated by this benchmark result below. If you find that it doesn't after changing min_buffer_size, then I can look into it further. By the way, in which papers do you see DDPG get an average return of 2500 after 1M time steps on Walker2d? At least in Soft Actor-Critic Algorithms and Applications and in Deep Reinforcement Learning that Matters, the expected average return is a little over 1000.

Hopefully that answers your questions, but please let me know if you have any others or there was something I missed.

surbhi1944 · 2019-12-06T17:03:27Z

Thanks for the reply.

Now i got the purpose of min_buffer_size variable. But still confused for n_train_steps. I think this is representing the number of times we want the updation in the weights of neural network (1 time for 1 batch). As more its value means more time optimize_policy function (line275 of https://github.com/rlworkgroup/garage/blob/master/src/garage/tf/algos/ddpg.py) will be called. Hence more time the weights will be updated. Please correct me if i am wrong.

If we want to do training on environment such as Humanoid (that need ~10million timesteps) then should we increase the n_train_step or only n_epoch and n_epoch_cycle

I saw:

~2000 average return of DDPG on Walker2d in 1million timesteps in the research paper titled "Addressing Function Approximation Error in Actor-Critic Methods".
~1800 average episodic return in 1 million timesteps in research paper titled: "Mutual-Information Regularization in Markov Decision Processes and Actor-Critic Learning".
~1800 average return in 1 million timesteps in the below link:
https://spinningup.openai.com/en/latest/spinningup/bench.html#id10

The results presented in the graph shown in your graph(above) also does not reach to even near about and converges at ~220 (garage_tf_trial1_seed30). Is there any formula to make comparison of ~220 (yours) to ~1000 (others). or what is reported in the above graph: return of single episode, return over multiple batches, average return over previous 100 episode, average return over 1000 timesteps, or other thing?

avnishn · 2019-12-06T23:09:08Z

Hello all, it seems that we have an issue in the way that we log average returns over time.
The issue seems to be over here:

garage/src/garage/torch/algos/ddpg.py

Line 144 in 1def654

self._episode_rewards.extend([

Over time essentially we're computing an average over all the returns that we our sampler has observed from rolling out the policy. We should only be calculating average returns over a certain period of recent training epochs (30 or 100, not all 500-1000?)

We'll make a fix and re run our baselines to verify that this is the case. Thank you @surbhi1944 .

ryanjulian · 2019-12-11T20:05:17Z

Quick update -- we were able to confirm your report and found lackluster performance in our tf/DDPG implementation. We are now auditing our implementation and making this fix the highest priority.

We're also benchmarking our torch/DDPG implementation to confirm whether or not the bug is shared.

We'll keep this issue updated.

krzentner changed the title ~~Too less avg_retrun~~ Average return very low in tf.DDPG Dec 5, 2019

ryanjulian added this to the v2020.02rc2 milestone Dec 7, 2019

ryanjulian assigned zequnyu and avnishn Dec 7, 2019

ryanjulian added the bug Something isn't working label Dec 11, 2019

ryanjulian modified the milestones: v2020.02rc2, v2020.02.0 Feb 27, 2020

ryanjulian assigned krzentner and unassigned zequnyu and avnishn Feb 27, 2020

ryanjulian added the backport-to-2019.10 Backport this PR to release-2019.10 label Mar 5, 2020

ryanjulian modified the milestones: v2020.03.0, v2020.06rc1, v2020.06.0 Mar 24, 2020

ryanjulian assigned maliesa96 and unassigned krzentner May 20, 2020

ryanjulian modified the milestones: v2020.06rc1, v2020.06.0, v2020.09.0 May 26, 2020

ryanjulian added the tf label Jun 11, 2020

ryanjulian added this to the v2020.09rc2 milestone Jun 18, 2020

maliesa96 removed their assignment Jul 8, 2020

ryanjulian modified the milestones: v2020.09rc2, v2020.09rc3 Jul 22, 2020

avnishn self-assigned this Aug 4, 2020

ryanjulian assigned irisliucy and unassigned avnishn Aug 5, 2020

irisliucy assigned irisliucy and unassigned irisliucy Aug 10, 2020

ryanjulian modified the milestones: v2020.09rc3, v2020.09rc4 Aug 17, 2020

irisliucy linked a pull request Aug 28, 2020 that will close this issue

Address TF/ DDPG low return #1981

Open

3 tasks

ryanjulian modified the milestones: v2020.09rc4, v2020.09rc5 Sep 2, 2020

irisliucy linked a pull request Sep 11, 2020 that will close this issue

Address TF/ DDPG low return #1981

Open

3 tasks

ryanjulian modified the milestones: v2020.09rc5, v2020.10.0rc6 Sep 15, 2020

ryanjulian modified the milestones: v2020.10.0rc6, v2020.10.0rc7 Oct 13, 2020

ryanjulian modified the milestones: v2020.10.0rc7, v2020.10.0 Oct 30, 2020

ryanjulian removed the backport-to-2019.10 Backport this PR to release-2019.10 label Nov 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Average return very low in tf.DDPG #1077

Average return very low in tf.DDPG #1077

surbhi1944 commented Nov 27, 2019 •

edited by krzentner

Loading

krzentner commented Dec 5, 2019 •

edited

Loading

surbhi1944 commented Dec 6, 2019

avnishn commented Dec 6, 2019

ryanjulian commented Dec 11, 2019

Average return very low in tf.DDPG #1077

Average return very low in tf.DDPG #1077

Comments

surbhi1944 commented Nov 27, 2019 • edited by krzentner Loading

krzentner commented Dec 5, 2019 • edited Loading

surbhi1944 commented Dec 6, 2019

avnishn commented Dec 6, 2019

ryanjulian commented Dec 11, 2019

surbhi1944 commented Nov 27, 2019 •

edited by krzentner

Loading

krzentner commented Dec 5, 2019 •

edited

Loading