You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Two estimators are proposed to counteract certain environment setups.
Lets assume you have one state that has transitions to n states where each of the n states has a low probability for a large reward. It is really likely to hit the large reward at least once for large n. This leads to picking the transition to the state over and over again because of the epsilon-greedy strategy of q-learning and sarsa and it takes a long time to converge back to the expected value of the q_value.
The text was updated successfully, but these errors were encountered:
Two estimators are proposed to counteract certain environment setups.
Lets assume you have one state that has transitions to n states where each of the n states has a low probability for a large reward. It is really likely to hit the large reward at least once for large n. This leads to picking the transition to the state over and over again because of the epsilon-greedy strategy of q-learning and sarsa and it takes a long time to converge back to the expected value of the q_value.
The text was updated successfully, but these errors were encountered: