-
Notifications
You must be signed in to change notification settings - Fork 313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Average return very low in tf.DDPG #1077
Comments
Hi Surbhi1944, thanks for opening this issue. Optimization starts at epoch 782 because you've set About your other question, I do believe that our implementations of DDPG should perform much better than this, as indicated by this benchmark result below. If you find that it doesn't after changing Hopefully that answers your questions, but please let me know if you have any others or there was something I missed. |
Thanks for the reply. Now i got the purpose of min_buffer_size variable. But still confused for n_train_steps. I think this is representing the number of times we want the updation in the weights of neural network (1 time for 1 batch). As more its value means more time optimize_policy function (line275 of https://github.com/rlworkgroup/garage/blob/master/src/garage/tf/algos/ddpg.py) will be called. Hence more time the weights will be updated. Please correct me if i am wrong. If we want to do training on environment such as Humanoid (that need ~10million timesteps) then should we increase the n_train_step or only n_epoch and n_epoch_cycle I saw:
The results presented in the graph shown in your graph(above) also does not reach to even near about and converges at ~220 (garage_tf_trial1_seed30). Is there any formula to make comparison of ~220 (yours) to ~1000 (others). or what is reported in the above graph: return of single episode, return over multiple batches, average return over previous 100 episode, average return over 1000 timesteps, or other thing? |
Hello all, it seems that we have an issue in the way that we log average returns over time. garage/src/garage/torch/algos/ddpg.py Line 144 in 1def654
Over time essentially we're computing an average over all the returns that we our sampler has observed from rolling out the policy. We should only be calculating average returns over a certain period of recent training epochs (30 or 100, not all 500-1000?) We'll make a fix and re run our baselines to verify that this is the case. Thank you @surbhi1944 . |
Quick update -- we were able to confirm your report and found lackluster performance in our tf/DDPG implementation. We are now auditing our implementation and making this fix the highest priority. We're also benchmarking our torch/DDPG implementation to confirm whether or not the bug is shared. We'll keep this issue updated. |
Formula for off_policy_method:
total_timeseteps= n_epochs * n_epoch_cycles * batch_size
then if
n_epochs=1400
n_epoch_cycles=20
batch_size=64
min_buffer_size=10^6
then total_timesteps=1400 * 20 * 64=1,792,000
I obtained the graph as shown in figure for DDPG_Walker2d-v3. It is showing very less average return. But most of the research papers shows ~2500 average_return on 1 million timesteps. How to set the parameters to get near about results.
My Code:
I found that it starts evaluation from 782epoch ( bcz 1000000//(20*64) ). Hence https://github.com/rlworkgroup/garage/blob/master/src/garage/tf/algos/ddpg.py condition on line 272 will be true from this epoch and policy optimization will be started from this point onward. Is this the reason for not getting results?
Another question i want to ask is: why this evaluation loop (line 271 of DDPG.py) is repeated n_train_steps (training steps) times? what is the purpose of evaluating n_train_steps times. Is this doing a kind of rollout repeated n_train_steps times? Where length of rollout is either end of episode or a trajectory of length =batch_size=64. (Line 173 of https://github.com/rlworkgroup/garage/blob/master/src/garage/sampler/off_policy_vectorized_sampler.py#L66 ?
The text was updated successfully, but these errors were encountered: