Skip to content

Commit 0035e98

Browse files
committed
adds policy iteration
1 parent edf0505 commit 0035e98

File tree

2 files changed

+43
-0
lines changed

2 files changed

+43
-0
lines changed

md/Policy-Iteration.md

+21
Original file line numberDiff line numberDiff line change
@@ -18,3 +18,24 @@ __function__ POLICY-ITERATION(_mdp_) __returns__ a policy
1818

1919
---
2020
__Figure ??__ The policy iteration algorithm for calculating an optimal policy.
21+
22+
---
23+
24+
## AIMA4e
25+
__function__ POLICY-ITERATION(_mdp_) __returns__ a policy
26+
 __inputs__: _mdp_, an MDP with states _S_, actions _A_(_s_), transition model _P_(_s′_ | _s_, _a_)
27+
 __local variables__: _U_, a vector of utilities for states in _S_, initially zero
28+
        _π_, a policy vector indexed by state, initially random
29+
30+
 __repeat__
31+
   _U_ ← POLICY\-EVALUATION(_π_, _U_, _mdp_)
32+
   _unchanged?_ ← true
33+
   __for each__ state _s_ __in__ _S_ __do__
34+
&emsp;&emsp;&emsp;&emsp;&emsp;_a <sup> &#x2a; </sup>_ &larr; argmax<sub>_a_ &isin; _A_(_s_)</sub> Q-VALUE(_mdp_,_s_,_a_,_U_)
35+
&emsp;&emsp;&emsp;&emsp;&emsp;__if__ Q-VALUE(_mdp_,_s_,_a<sup>&#x2a;</sup>_,_U_) &gt; Q-VALUE(_mdp_,_s_,_&pi;_\[_s_\],_U_) __then do__
36+
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;_&pi;_\[_s_\] &larr; _a<sup>&#x2a;</sup>_ ; _unchanged?_ &larr; false
37+
&emsp;__until__ _unchanged?_
38+
&emsp;__return__ _&pi;_
39+
40+
---
41+
__Figure ??__ The policy iteration algorithm for calculating an optimal policy.

md/Value-Iteration.md

+22
Original file line numberDiff line numberDiff line change
@@ -18,3 +18,25 @@ __function__ VALUE-ITERATION(_mdp_, _&epsi;_) __returns__ a utility function
1818

1919
---
2020
__Figure ??__ The value iteration algorithm for calculating utilities of states. The termination condition is from Equation (__??__).
21+
22+
---
23+
24+
## AIMA4e
25+
__function__ VALUE-ITERATION(_mdp_, _&epsi;_) __returns__ a utility function
26+
&emsp;__inputs__: _mdp_, an MDP with states _S_, actions _A_(_s_), transition model _P_(_s&prime;_ &vert; _s_, _a_),
27+
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;rewards _R_(_s_,_a_,_s&prime;_), discount _&gamma;_
28+
&emsp;&emsp;&emsp;_&epsi;_, the maximum error allowed in the utility of any state
29+
&emsp;__local variables__: _U_, _U&prime;_, vectors of utilities for states in _S_, initially zero
30+
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;_&delta;_, the maximum change in the utility of any state in an iteration
31+
32+
&emsp;__repeat__
33+
&emsp;&emsp;&emsp;_U_ &larr; _U&prime;_; _&delta;_ &larr; 0
34+
&emsp;&emsp;&emsp;__for each__ state _s_ in _S_ __do__
35+
&emsp;&emsp;&emsp;&emsp;&emsp;_U&prime;_\[_s_\] &larr; max<sub>_a_ &isin; _A_(_s_)</sub> Q-VALUE(_mdp_,_s_,_a_,_U_)
36+
&emsp;&emsp;&emsp;&emsp;&emsp;__if__ &vert; _U&prime;_\[_s_\] &minus; _U_\[_s_\] &vert; &gt; _&delta;_ __then__ _&delta;_ &larr; &vert; _U&prime;_\[_s_\] &minus; _U_\[_s_\] &vert;
37+
&emsp;__until__ _&delta;_ &lt; _&epsi;_(1 &minus; _&gamma;_)&sol;_&gamma;_
38+
&emsp;__return__ _U_
39+
40+
---
41+
__Figure ??__ The value iteration algorithm for calculating utilities of states. The termination condition is from Equation (__??__).
42+
~

0 commit comments

Comments
 (0)