Skip to content

Commit 1b1e32d

Browse files
committed
added link refs to each list item in sidebar; minor code improvements
1 parent 633d212 commit 1b1e32d

File tree

1 file changed

+33
-52
lines changed

1 file changed

+33
-52
lines changed

DRL/Monte Carlo Methods/index.html

Lines changed: 33 additions & 52 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@
1212
<link rel="stylesheet" href="../../assets/css/jquery.mCustomScrollbar.min.css">
1313
<link rel="stylesheet" href="../../assets/css/styles.css">
1414
<link rel="stylesheet" href="../../assets/css/cc-icons.min.css"> <!-- Creative Commons Icons -->
15-
<link rel="shortcut icon" type="image/png" href="../../assets/img/robo-icon.png" />
15+
<link rel="shortcut icon" type="image/png" href="../../assets/img/robo-icon.png">
1616
<style type="text/css">
1717
/* Three image containers (use 25% for four, and 50% for two, etc) */
1818
.column {
@@ -46,34 +46,34 @@
4646

4747
<ul class="sidebar-list list-unstyled components">
4848
<li class="">
49-
<a href="#">01. Introduction</a>
49+
<a href="#intro">01. Introduction</a>
5050
</li>
5151
<li class="">
52-
<a href="#">02. MC Prediction: State Values</a>
52+
<a href="#mc-prediction-state-values">02. MC Prediction: State Values</a>
5353
</li>
5454
<li class="">
55-
<a href="#">03. MC Prediction: Action Values</a>
55+
<a href="#mc-prediction-action-values">03. MC Prediction: Action Values</a>
5656
</li>
5757
<li class="">
58-
<a href="#">04. Generalized Policy Iteration</a>
58+
<a href="#generalized-policy-iteration">04. Generalized Policy Iteration</a>
5959
</li>
6060
<li class="">
61-
<a href="#">05. MC Control: Incremental Mean</a>
61+
<a href="#mc-control-incremental-mean">05. MC Control: Incremental Mean</a>
6262
</li>
6363
<li class="">
64-
<a href="#">06. MC Control: Policy Evaluation</a>
64+
<a href="#mc-control-policy-evaluation">06. MC Control: Policy Evaluation</a>
6565
</li>
6666
<li class="">
67-
<a href="#">07. MC Control: Policy Improvement</a>
67+
<a href="#mc-control-policy-improvement">07. MC Control: Policy Improvement</a>
6868
</li>
6969
<li class="">
70-
<a href="#">08. Epsilon-Greedy Policies</a>
70+
<a href="#epsilon-greedy-policies">08. Epsilon-Greedy Policies</a>
7171
</li>
7272
<li class="">
73-
<a href="#">09. Exploration vs. Exploitation</a>
73+
<a href="#exploration-vs-exploitation">09. Exploration vs. Exploitation</a>
7474
</li>
7575
<li class="">
76-
<a href="#">10. MC Control: Constant-alpha</a>
76+
<a href="#mc-control-constant-alpha">10. MC Control: Constant-alpha</a>
7777
</li>
7878
</ul>
7979

@@ -110,14 +110,13 @@ <h1 style="display: inline-block">Monte Carlo Methods </h1>
110110
<div class="row">
111111
<div class="col-12">
112112
<div class="ud-atom">
113-
<h3></h3>
114-
<!-- <div>
115-
<h1 id="summary">Monte Carlo Methods</h1>
116-
</div> -->
117-
113+
<div>
114+
<p><a name="intro"></a></p>
115+
<h1 id="summary">Introduction</h1>
116+
</div>
118117
</div>
119118
<!-- <div class="divider"></div><div class="ud-atom">
120-
<h3></h3> -->
119+
<h3></h3> -->
121120
RL problems where the agent is not given the full knowledge of how the environement operates and instead must learn from interaction. In other words, <strong>the environment is unknown to the agent</strong>.
122121
<div>
123122
<figure class="figure">
@@ -127,13 +126,12 @@ <h3></h3> -->
127126
</figcaption>
128127
</figure>
129128
</div>
130-
131-
132129
</div>
130+
133131
<div class="divider"></div>
134132
<div class="ud-atom">
135-
<h3></h3>
136133
<div>
134+
<p><a name="mc-prediction-state-values"></a></p>
137135
<h2 id="-mc-prediction-state-values">MC Prediction: State Values</h2>
138136
<ul>
139137
<li>Algorithms that solve the <strong>prediction problem</strong> determine the value function <span class="mathquill ud-math">v_\pi</span> (or <span class="mathquill ud-math">q_\pi</span>) corresponding to a policy <span class="mathquill ud-math">\pi</span>.</li>
@@ -190,12 +188,7 @@ <h2 id="-mc-prediction-state-values">MC Prediction: State Values</h2>
190188
<img src="img/mc10.png" alt="mc10" style="width:100%">
191189
</div>
192190
</div>
193-
<figure class="figure">
194-
<img src="img/mc-pred-state.png" alt="" style="width:60%" class="img img-fluid">
195-
<figcaption class="figure-caption">
196-
197-
</figcaption>
198-
</figure>
191+
<img src="img/mc-pred-state.png" alt="" style="width:60%" class="img img-fluid">
199192

200193
<p>If you are interested in learning more about the difference between first-visit and every-visit MC methods, you are encouraged to read Section 3 of [<a href="https://link.springer.com/article/10.1007/BF00114726">this</a> <a href="https://link.springer.com/content/pdf/10.1007/BF00114726.pdf">paper</a>]
201194
<br
@@ -208,17 +201,13 @@ <h2 id="-mc-prediction-state-values">MC Prediction: State Values</h2>
208201
<p>Both the first-visit and every-visit method are <strong>guaranteed to converge</strong> to the true value function, as the number of visits to each state approaches infinity. (<em>So, in other words, as long as the agent gets enough experience with each state, the value function estimate will be pretty close to the true value.</em>)
209202
In the case of first-visit MC, convergence follows from the <a href="https://en.wikipedia.org/wiki/Law_of_large_numbers" target="_blank">Law of Large Numbers</a>, and the details are covered in section 5.1 of the <a href="http://go.udacity.com/rl-textbook"
210203
target="_blank">textbook</a>.</p>
211-
<figure class="figure">
212-
<img src="img/mc11.png" alt="" style="width:80%" class="img img-fluid">
213-
<figcaption class="figure-caption">
214-
215-
</figcaption>
216-
</figure>
204+
<img src="img/mc11.png" alt="" style="width:80%" class="img img-fluid"/>
217205
</div>
206+
218207
<div class="divider"></div>
219208
<div class="ud-atom">
220-
<h3></h3>
221209
<div>
210+
<p><a name="mc-prediction-action-values"></a></p>
222211
<h2 id="-mc-prediction-action-values">MC Prediction: Action Values</h2> In the <strong>Dynamic Programming</strong> case, we used the state value function to obtain an action value function, as given below:
223212
<br />
224213
<br /><span class="mathquill ud-math">q_\pi(s,a) = \sum_{s'\in\mathcal{S}, r\in\mathcal{R}}p(s',r|s,a)(r+\gamma v_\pi(s'))</span>
@@ -271,7 +260,6 @@ <h2 id="-mc-prediction-action-values">MC Prediction: Action Values</h2> In the <
271260
<figure class="figure">
272261
<img src="img/mc-pred-action.png" alt="" style="width:60%" class="img img-fluid">
273262
<figcaption class="figure-caption">
274-
275263
</figcaption>
276264
</figure>
277265
<div>
@@ -281,10 +269,11 @@ <h2 id="-mc-prediction-action-values">MC Prediction: Action Values</h2> In the <
281269
selected from each state.</p>
282270
</div>
283271
</div>
272+
284273
<div class="divider"></div>
285274
<div class="ud-atom">
286-
<h3></h3>
287275
<div>
276+
<p><a name="generalized-policy-iteration"></a></p>
288277
<h2 id="-generalized-policy-iteration">Generalized Policy Iteration</h2>
289278
<ul>
290279
<li>Algorithms designed to solve the <strong>control problem</strong> determine the optimal policy <span class="mathquill ud-math">\pi_*</span> from interaction with the environment.</li>
@@ -304,10 +293,11 @@ <h2 id="-generalized-policy-iteration">Generalized Policy Iteration</h2>
304293
</div>
305294
</div>
306295
</div>
296+
307297
<div class="divider"></div>
308298
<div class="ud-atom">
309-
<h3></h3>
310299
<div>
300+
<p><a name="mc-control-incremental-mean"></a></p>
311301
<h2 id="-mc-control-incremental-mean">MC Control: Incremental Mean</h2>
312302
<ul>
313303
<li>In this concept, we derived an algorithm that keeps a running average of a sequence of numbers.</li>
@@ -339,10 +329,11 @@ <h2 id="-mc-control-incremental-mean">MC Control: Incremental Mean</h2>
339329
<p>In this, we learned about an algorithm that can keep a running estimate of the mean of a sequence of numbers <span class="mathquill ud-math">(x_1, x_2, \ldots, x_n)</span>. The algorithm looked at each number in the sequence in
340330
order, and successively updated the mean <span class="mathquill ud-math">\mu</span>.</p>
341331
</div>
332+
342333
<div class="divider"></div>
343334
<div class="ud-atom">
344-
<h3></h3>
345335
<div>
336+
<p><a name="mc-control-policy-evaluation"></a></p>
346337
<h2 id="-mc-control-policy-evaluation">MC Control: Policy Evaluation</h2>
347338
<ul>
348339
<li>In this concept, we amended the policy evaluation step to update the value function after every episode of interaction.</li>
@@ -362,8 +353,8 @@ <h2 id="-mc-control-policy-evaluation">MC Control: Policy Evaluation</h2>
362353
</div>
363354
<div class="divider"></div>
364355
<div class="ud-atom">
365-
<h3></h3>
366356
<div>
357+
<p><a name="mc-control-policy-improvement"></a></p>
367358
<h2 id="-mc-control-policy-improvement">MC Control: Policy Improvement</h2>
368359
<ul>
369360
<li>A policy is <strong>greedy</strong> with respect to an action-value function estimate <span class="mathquill ud-math">Q</span> if for every state <span class="mathquill ud-math">s\in\mathcal{S}</span>, it is guaranteed
@@ -448,13 +439,12 @@ <h2 id="-mc-control-policy-improvement">MC Control: Policy Improvement</h2>
448439
</div>
449440

450441
<div>
451-
<h1 id="quiz-epsilon-greedy-policies">Epsilon-Greedy Policies</h1>
442+
<p><a name="epsilon-greedy-policies"></a></p>
443+
<h2 id="-epsilon-greedy-policies">Epsilon-Greedy Policies</h2>
452444
</div>
453-
454445
</div>
455446
<!-- <div class="divider"></div> -->
456447
<div class="ud-atom">
457-
<h3></h3>
458448
<div>
459449
<p>You can think of the agent who follows an <span class="mathquill ud-math">\epsilon</span>-greedy policy as always having a (potentially unfair) coin at its disposal, with probability <span class="mathquill ud-math">\epsilon</span> of landing heads. After observing a state, the agent flips the coin.</p>
460450
<ul>
@@ -464,7 +454,6 @@ <h3></h3>
464454
<p>In order to construct a policy <span class="mathquill ud-math">\pi</span> that is <span class="mathquill ud-math">\epsilon</span>-greedy with respect to the current action-value function estimate <span class="mathquill ud-math">Q</span>,
465455
we need only set</p>
466456
</div>
467-
468457
</div>
469458
<!-- <div class="divider"></div> -->
470459
<div class="ud-atom">
@@ -477,8 +466,6 @@ <h3></h3>
477466
</figcaption>
478467
</figure>
479468
</div>
480-
481-
482469
</div>
483470
<!-- <div class="divider"></div> -->
484471
<div class="ud-atom">
@@ -490,8 +477,8 @@ <h3></h3>
490477

491478
<div class="divider"></div>
492479
<div class="ud-atom">
493-
<h3></h3>
494480
<div>
481+
<p><a name="exploration-vs-exploitation"></a></p>
495482
<h2 id="-exploration-vs-exploitation">Exploration vs. Exploitation</h2>
496483
<ul>
497484
<li>All reinforcement learning agents face the <strong>Exploration-Exploitation Dilemma</strong>, where they must find a way to balance the drive to behave optimally based on their current knowledge (<strong>exploitation</strong>)
@@ -578,8 +565,8 @@ <h2 id="-setting-the-value-of-span-classmathquill-ud-mathepsilonspan-in-practice
578565
</div>
579566
<!-- <div class="divider"></div> -->
580567
<div class="ud-atom">
581-
<h3></h3>
582568
<div>
569+
<p><a name="mc-control-constant-alpha"></a></p>
583570
<h2 id="-mc-control-constant-alpha">MC Control: Constant-alpha</h2>
584571
<ul>
585572
<li>(In this concept, we derived the algorithm for <strong>constant-<span class="mathquill ud-math">\alpha</span> MC control</strong>, which uses a constant step-size parameter <span class="mathquill ud-math">\alpha</span>.)</li>
@@ -640,8 +627,6 @@ <h3></h3>
640627
</figcaption>
641628
</figure>
642629
</div>
643-
644-
645630
</div>
646631
<!-- <div class="divider"></div> -->
647632
<div class="ud-atom">
@@ -679,10 +664,7 @@ <h3></h3>
679664
</figure>
680665
</div>
681666
</div>
682-
683-
684667
<div class="divider"></div>
685-
686668
</div>
687669
</main>
688670

@@ -698,7 +680,6 @@ <h3></h3>
698680
</div>
699681
</footer>
700682

701-
702683
<script src="../../assets/js/jquery-3.3.1.min.js"></script>
703684
<script src="../../assets/js/plyr.polyfilled.min.js"></script>
704685
<script src="../../assets/js/bootstrap.min.js"></script>

0 commit comments

Comments
 (0)