Decrease in reward during training with MaskablePPO #207
Labels
custom gym env
Issue related to Custom Gym Env
more information needed
Please fill the issue template completely
question
Further information is requested
❓ Question
Hi,
During training in a custom environment with MaskablePPO, the reward decreased and then converged. Is there any specific reason? It means the algorithm has found a better policy but is outputting another one?
![image](https://private-user-images.githubusercontent.com/69747871/264948878-8ca112e1-ff6c-4813-9e1b-de114199a626.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkzNTQ1MzgsIm5iZiI6MTczOTM1NDIzOCwicGF0aCI6Ii82OTc0Nzg3MS8yNjQ5NDg4NzgtOGNhMTEyZTEtZmY2Yy00ODEzLTllMWItZGUxMTQxOTlhNjI2LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTIlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjEyVDA5NTcxOFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTFhMDVlYmY2MDQ3MWNkM2IzODExZjQwMjEwNDAwMWI0MmJkYWMxM2E1YmVkOGRlZjdkMDMxYmExOTJlNzdjNjEmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.Mj65w5Kj0fcLf9LF7WngOzP2_6Z89RCzmzSmh3c3Yxs)
My environment has two normalized rewards that will be weighted sum to measure the final reward. I have 19 timestep and my gamma was set to 0.001.
class customenv(gym.Env):....
env = customenv()
env = ActionMasker(env, mask_fn)
model = MaskablePPO(MaskableActorCriticPolicy, env, gamma = 0.0001, verbose=0)
model.learn(4000000)
Thank you!
Checklist
The text was updated successfully, but these errors were encountered: