You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
RL problems where the agent is not given the full knowledge of how the environement operates and instead must learn from interaction. In other words, <strong>the environment is unknown to the agent</strong>.
122
121
<div>
123
122
<figureclass="figure">
@@ -127,13 +126,12 @@ <h3></h3> -->
127
126
</figcaption>
128
127
</figure>
129
128
</div>
130
-
131
-
132
129
</div>
130
+
133
131
<divclass="divider"></div>
134
132
<divclass="ud-atom">
135
-
<h3></h3>
136
133
<div>
134
+
<p><aname="mc-prediction-state-values"></a></p>
137
135
<h2id="-mc-prediction-state-values">MC Prediction: State Values</h2>
138
136
<ul>
139
137
<li>Algorithms that solve the <strong>prediction problem</strong> determine the value function <spanclass="mathquill ud-math">v_\pi</span> (or <spanclass="mathquill ud-math">q_\pi</span>) corresponding to a policy <spanclass="mathquill ud-math">\pi</span>.</li>
@@ -190,12 +188,7 @@ <h2 id="-mc-prediction-state-values">MC Prediction: State Values</h2>
<p>If you are interested in learning more about the difference between first-visit and every-visit MC methods, you are encouraged to read Section 3 of [<ahref="https://link.springer.com/article/10.1007/BF00114726">this</a><ahref="https://link.springer.com/content/pdf/10.1007/BF00114726.pdf">paper</a>]
201
194
<br
@@ -208,17 +201,13 @@ <h2 id="-mc-prediction-state-values">MC Prediction: State Values</h2>
208
201
<p>Both the first-visit and every-visit method are <strong>guaranteed to converge</strong> to the true value function, as the number of visits to each state approaches infinity. (<em>So, in other words, as long as the agent gets enough experience with each state, the value function estimate will be pretty close to the true value.</em>)
209
202
In the case of first-visit MC, convergence follows from the <ahref="https://en.wikipedia.org/wiki/Law_of_large_numbers" target="_blank">Law of Large Numbers</a>, and the details are covered in section 5.1 of the <ahref="http://go.udacity.com/rl-textbook"
<h2id="-mc-prediction-action-values">MC Prediction: Action Values</h2> In the <strong>Dynamic Programming</strong> case, we used the state value function to obtain an action value function, as given below:
<li>Algorithms designed to solve the <strong>control problem</strong> determine the optimal policy <spanclass="mathquill ud-math">\pi_*</span> from interaction with the environment.</li>
<p>In this, we learned about an algorithm that can keep a running estimate of the mean of a sequence of numbers <spanclass="mathquill ud-math">(x_1, x_2, \ldots, x_n)</span>. The algorithm looked at each number in the sequence in
340
330
order, and successively updated the mean <spanclass="mathquill ud-math">\mu</span>.</p>
<li>A policy is <strong>greedy</strong> with respect to an action-value function estimate <spanclass="mathquill ud-math">Q</span> if for every state <spanclass="mathquill ud-math">s\in\mathcal{S}</span>, it is guaranteed
<p>You can think of the agent who follows an <spanclass="mathquill ud-math">\epsilon</span>-greedy policy as always having a (potentially unfair) coin at its disposal, with probability <spanclass="mathquill ud-math">\epsilon</span> of landing heads. After observing a state, the agent flips the coin.</p>
460
450
<ul>
@@ -464,7 +454,6 @@ <h3></h3>
464
454
<p>In order to construct a policy <spanclass="mathquill ud-math">\pi</span> that is <spanclass="mathquill ud-math">\epsilon</span>-greedy with respect to the current action-value function estimate <spanclass="mathquill ud-math">Q</span>,
465
455
we need only set</p>
466
456
</div>
467
-
468
457
</div>
469
458
<!-- <div class="divider"></div> -->
470
459
<divclass="ud-atom">
@@ -477,8 +466,6 @@ <h3></h3>
477
466
</figcaption>
478
467
</figure>
479
468
</div>
480
-
481
-
482
469
</div>
483
470
<!-- <div class="divider"></div> -->
484
471
<divclass="ud-atom">
@@ -490,8 +477,8 @@ <h3></h3>
490
477
491
478
<divclass="divider"></div>
492
479
<divclass="ud-atom">
493
-
<h3></h3>
494
480
<div>
481
+
<p><aname="exploration-vs-exploitation"></a></p>
495
482
<h2id="-exploration-vs-exploitation">Exploration vs. Exploitation</h2>
496
483
<ul>
497
484
<li>All reinforcement learning agents face the <strong>Exploration-Exploitation Dilemma</strong>, where they must find a way to balance the drive to behave optimally based on their current knowledge (<strong>exploitation</strong>)
<li>(In this concept, we derived the algorithm for <strong>constant-<spanclass="mathquill ud-math">\alpha</span> MC control</strong>, which uses a constant step-size parameter <spanclass="mathquill ud-math">\alpha</span>.)</li>
0 commit comments