Merge pull request #15 from SnShine/master

ctjoreilly · web-flow · commit 034ec6d346ee · 2016-06-18T12:31:56.000-07:00
adds 3 markdown files from RL chapter
diff --git a/md/Passive-ADP-Agent.md b/md/Passive-ADP-Agent.md
@@ -0,0 +1,21 @@
+# PASSIVE-ADP-AGENT
+
+## AIMA3e
+__function__ Passive-ADP-Agent(_percept_) __returns__ and action  
+&emsp;__inputs__: _percept_, a percept indication the current state _s'_ and reward signal _r'_  
+&emsp;__persistent__: _&pi;_, a fixed policy  
+&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;_mdp_, an MDP with model _P_, rewards _R_, discount &gamma;  
+&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;_U_, a table of utilities, initially empty  
+&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;_N<sub>sa</sub>_, a table of frequencies for state-action pairs, initially zero  
+&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;_N<sub>s'|sa</sub>_, a table of outcome frequencies given state-action pairs, initially zero  
+&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;_s_, _a_, the previous state and action, initially null  
+&emsp;__if__ _s'_ is new __then__ _U_[_s'_] &larr; _r'_; _R_[_s'_] &larr; _r'_  
+&emsp;__if__ _s_ is not null __then__  
+&emsp;&emsp;&emsp;increment _N<sub>sa</sub>_[_s_, _a_] and _N<sub>s'|sa</sub>_[_s'_, _s_, _a_]  
+&emsp;&emsp;&emsp;__for each__ _t_ such that _N<sub>s'|sa</sub>_[_t_, _s_, _a_] is nonzero __do__  
+&emsp;&emsp;&emsp;&emsp;&emsp;_P_(_t_ | _s_, _a_) &larr; _N<sub>s'|sa</sub>_[_t_, _s_, _a_] / _N<sub>sa</sub>_[_s_, _a_]  
+&emsp;_U_ &larr; Policy-Evaluation(_&pi;_, _U_, _mdp_)  
+&emsp;__if__ _s'_.Terminal? __then__ _s_, _a_ &larr; null __else__ _s_, _a_ &larr; _s'_, _&pi;_[_s'_]  
+
+---
+__Figure ??__ A passive reinforcement learning agent based on adaptive dynamic programming. The Policy-Evaluation function solves the fixed-policy Bellman equations, as described on page ??.
diff --git a/md/Passive-TD-Agent.md b/md/Passive-TD-Agent.md
@@ -0,0 +1,19 @@
+# PASSIVE-TD-AGENT
+
+## AIMA3e
+__function__ Passive-TD-Agent(_percept_) __returns__ an action  
+&emsp;__inputs__: _percept_, a percept indication the current state _s'_ and reward signal _r'_  
+&emsp;__persistent__: _&pi;_, a fixed policy  
+&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;_U_, a table of utilities, initially empty  
+&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;_N<sub>s</sub>_, a table of frequencies for states, initially zero  
+&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;_s_, _a_, _r_, the previous state, action, and reward, initially null  
+
+&emsp;__if__ _s'_ is new __then__ _U_[_s'_] &larr; _r'_  
+&emsp;__if__ _s_ is not null __then__  
+&emsp;&emsp;&emsp;increment _N<sub>s</sub>_[_s_]  
+&emsp;&emsp;&emsp;_U_[_s_] &larr; _U_[_s_] + _&alpha;_(_N<sub>s</sub>_[_s_])(r + _&gamma;_ _U_[_s'_] - _U_[_s_])  
+&emsp;__if__ _s'_.Terminal? __then__ _s_, _a_, _r_ &larr; null __else__ _s_, _a_, _r_ &larr; _s'_, _&pi;_[_s'_], _r'_  
+&emsp;return _a_
+
+---
+__Figure ??__ A passive reinforcement learning agent that learns utility estimates using temporal differences. The step-size function &alpha;(_n_) is chosen to ensure convergence, as described in the text.
diff --git a/md/Q-Learning-Agent.md b/md/Q-Learning-Agent.md
@@ -0,0 +1,18 @@
+# Q-LEARNING-AGENT
+
+#AIMA3e
+__function__ Q-Learning_Agent(_percept_) __returns__ an action  
+&emsp;__inputs__: _percept_, a percept indicating the current state _s'_ and reward signal _r'_  
+&emsp;__persistent__: _Q_, a table of action values indexed by state and action, initially zero  
+&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;_N<sub>sa</sub>_, a table of frequencies for state-action pairs, initially zero  
+&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;_s_, _a_, _r_, the previous state, action, and reward, initially null  
+
+&emsp;__if__ Terminal?(_s_) then _Q_[_s_, None] &larr; _r'_  
+&emsp;__if__ _s_ is not null __then__  
+&emsp;&emsp;&emsp;increment _N<sub>sa</sub>_[_s_, _a_]  
+&emsp;&emsp;&emsp;_Q_[_s_, _a_] &larr; _Q_[_s_, _a_] + _&alpha;_(_N<sub>sa</sub>_[_s_, _a_])(_r_ + _&gamma;_ max<sub>a'</sub> _Q_[_s'_, _a'_] - _Q_[_s_, _a_])  
+&emsp;_s_, _a_, _r_ &larr; _s'_, argmax<sub>a'</sub> _f_(_Q_[_s'_, _a'_], _N<sub>sa</sub>_[_s'_, _a'_]), _r'_  
+&emsp;__return__ _a_  
+
+---
+__Figure ??__ An exploratory Q-learning agent. It is an active learner that learns the value _Q_(_s_, _a_) of each action in each situation. It uses the same exploration function _f_ as the exploratory ADP agent, but avoids having to learn the transition model because the Q-value of a state can be related directly to those of its neighbors.