Skip to content

Commit 034ec6d

Browse files
authored
Merge pull request #15 from SnShine/master
adds 3 markdown files from RL chapter
2 parents 6f21610 + 9ff7b46 commit 034ec6d

File tree

3 files changed

+58
-0
lines changed

3 files changed

+58
-0
lines changed

Diff for: md/Passive-ADP-Agent.md

+21
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
# PASSIVE-ADP-AGENT
2+
3+
## AIMA3e
4+
__function__ Passive-ADP-Agent(_percept_) __returns__ and action
5+
 __inputs__: _percept_, a percept indication the current state _s'_ and reward signal _r'_
6+
 __persistent__: _π_, a fixed policy
7+
       _mdp_, an MDP with model _P_, rewards _R_, discount γ
8+
       _U_, a table of utilities, initially empty
9+
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;_N<sub>sa</sub>_, a table of frequencies for state-action pairs, initially zero
10+
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;_N<sub>s'|sa</sub>_, a table of outcome frequencies given state-action pairs, initially zero
11+
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;_s_, _a_, the previous state and action, initially null
12+
&emsp;__if__ _s'_ is new __then__ _U_[_s'_] &larr; _r'_; _R_[_s'_] &larr; _r'_
13+
&emsp;__if__ _s_ is not null __then__
14+
&emsp;&emsp;&emsp;increment _N<sub>sa</sub>_[_s_, _a_] and _N<sub>s'|sa</sub>_[_s'_, _s_, _a_]
15+
&emsp;&emsp;&emsp;__for each__ _t_ such that _N<sub>s'|sa</sub>_[_t_, _s_, _a_] is nonzero __do__
16+
&emsp;&emsp;&emsp;&emsp;&emsp;_P_(_t_ | _s_, _a_) &larr; _N<sub>s'|sa</sub>_[_t_, _s_, _a_] / _N<sub>sa</sub>_[_s_, _a_]
17+
&emsp;_U_ &larr; Policy-Evaluation(_&pi;_, _U_, _mdp_)
18+
&emsp;__if__ _s'_.Terminal? __then__ _s_, _a_ &larr; null __else__ _s_, _a_ &larr; _s'_, _&pi;_[_s'_]
19+
20+
---
21+
__Figure ??__ A passive reinforcement learning agent based on adaptive dynamic programming. The Policy-Evaluation function solves the fixed-policy Bellman equations, as described on page ??.

Diff for: md/Passive-TD-Agent.md

+19
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
# PASSIVE-TD-AGENT
2+
3+
## AIMA3e
4+
__function__ Passive-TD-Agent(_percept_) __returns__ an action
5+
&emsp;__inputs__: _percept_, a percept indication the current state _s'_ and reward signal _r'_
6+
&emsp;__persistent__: _&pi;_, a fixed policy
7+
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;_U_, a table of utilities, initially empty
8+
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;_N<sub>s</sub>_, a table of frequencies for states, initially zero
9+
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;_s_, _a_, _r_, the previous state, action, and reward, initially null
10+
11+
&emsp;__if__ _s'_ is new __then__ _U_[_s'_] &larr; _r'_
12+
&emsp;__if__ _s_ is not null __then__
13+
&emsp;&emsp;&emsp;increment _N<sub>s</sub>_[_s_]
14+
&emsp;&emsp;&emsp;_U_[_s_] &larr; _U_[_s_] + _&alpha;_(_N<sub>s</sub>_[_s_])(r + _&gamma;_ _U_[_s'_] - _U_[_s_])
15+
&emsp;__if__ _s'_.Terminal? __then__ _s_, _a_, _r_ &larr; null __else__ _s_, _a_, _r_ &larr; _s'_, _&pi;_[_s'_], _r'_
16+
&emsp;return _a_
17+
18+
---
19+
__Figure ??__ A passive reinforcement learning agent that learns utility estimates using temporal differences. The step-size function &alpha;(_n_) is chosen to ensure convergence, as described in the text.

Diff for: md/Q-Learning-Agent.md

+18
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
# Q-LEARNING-AGENT
2+
3+
#AIMA3e
4+
__function__ Q-Learning_Agent(_percept_) __returns__ an action
5+
&emsp;__inputs__: _percept_, a percept indicating the current state _s'_ and reward signal _r'_
6+
&emsp;__persistent__: _Q_, a table of action values indexed by state and action, initially zero
7+
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;_N<sub>sa</sub>_, a table of frequencies for state-action pairs, initially zero
8+
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;_s_, _a_, _r_, the previous state, action, and reward, initially null
9+
10+
&emsp;__if__ Terminal?(_s_) then _Q_[_s_, None] &larr; _r'_
11+
&emsp;__if__ _s_ is not null __then__
12+
&emsp;&emsp;&emsp;increment _N<sub>sa</sub>_[_s_, _a_]
13+
&emsp;&emsp;&emsp;_Q_[_s_, _a_] &larr; _Q_[_s_, _a_] + _&alpha;_(_N<sub>sa</sub>_[_s_, _a_])(_r_ + _&gamma;_ max<sub>a'</sub> _Q_[_s'_, _a'_] - _Q_[_s_, _a_])
14+
&emsp;_s_, _a_, _r_ &larr; _s'_, argmax<sub>a'</sub> _f_(_Q_[_s'_, _a'_], _N<sub>sa</sub>_[_s'_, _a'_]), _r'_
15+
&emsp;__return__ _a_
16+
17+
---
18+
__Figure ??__ An exploratory Q-learning agent. It is an active learner that learns the value _Q_(_s_, _a_) of each action in each situation. It uses the same exploration function _f_ as the exploratory ADP agent, but avoids having to learn the transition model because the Q-value of a state can be related directly to those of its neighbors.

0 commit comments

Comments
 (0)