-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enviornment setup needs to be changed. #16
Comments
@MrBanhBao Hao could you please answer this? There is a cumulative reward formula that we discussed that takes into consideration when the component was repaired. Components that were repaired earlier have a higher weight on the cumulative reward. Repairing a component twice seems a bit counter-intuitive. i agree that we can add a negative reward to discourage that. |
@2start @christianadriano I already stated this problems as a comment in this ticket. I am closing the the old one to prevent redundancy. Regarding the negative reward. IMO it should be in a reasonable value range as the reward itself. Since we have the knowledge that taking action (component, failure) more than once (if the repair was successful) is useless and unnecessary why don't we prevent those actions? Regarding the decreasing impact of later repairs. We could implement following: |
@MrBanhBao Yes, in retrospective it seems like a good idea for me to prohibit or punish the repeated repairs. The idea to just use a dynamic action space for the environment seems sensible to just remove useless actions from the action space. However, at the moment I would strongly favor a negative reward for the following pragmatic reasons:
Regarding the decreasing impact of later repairs. I think this is environment specific and should be modeled in the environment because it's the environment that models the reward system and the agents responsibility is just to learn the reward system. If we would put the decreasing reward logic into the agent we would put environment specific logic into the agent/algorithm. |
Great discussion. Hence, I believe that negative rewards are a more flexible and general way of modeling this uncertainty of fix/not fixed/failed to fix. In the future the environment could be extended to considere three types of reward: This last type of reward could allow the agent to learn that some components are more difficult to fix. This would even allow to model fixing dependencies between components. i.e., given two pairs of component-failure, <C1,F1> and <C2,F1>, the first pair can only be fixed after the second pair has been successfully fixed, otherwise fixing <C1,F1> will always fail. btw., all this discussion could be copy-pasted to the final report. |
Currently, it seems like the rewards are somehow set up in a wrong way.
The total reward does not seem to rise while the algorithms are training.
See Notebook
Brainstorming:
The text was updated successfully, but these errors were encountered: