We have an agent in an unknown environment, which can obtain by rewards by interacting with it. The goal is now learning a good policy for the agent from experimental trials.
The agent acts in an environment, which is defined by a model that we may or may not know. The agent can be in any of many states, pick one of many actions to switch between states. When these switches happen, the agent can collect reward. Which state the agent transitions too might be decided stochastically. A policy is a function that maps states to actions, and the optimal policy maximizes the total rewards. Transition probabilities are therefore in the form
MDPs are a formalization that is often used to describe RL problems: they are sets
The long-term cumulative reward over a trajectory
Bellman equations are able to decompose the value function into the immediate reward plus the discounted future values.
We can make these recursive as follows:
The Bellman equations can also be expressed in matrix form: $\mathbf{v}{\pi}=\gamma P{\pi} \mathbf{v}{\pi}+\mathbf{r}{\pi}$
As the optimal policy is greedy, we can introduce the Bellman optimality equations. These are the same as the Bellman equations, but with the optimal policy substituted:
Value and policy iterations, techniques in the field of Dynamic Programming, allow us to iteratively solve the value and policy functions.
To perform policy iteration, we iteratively evaluate a policy and greedify: we start from
$$ v_{k+1}(s) \leftarrow \max {a} \sum{s^{\prime}} p\left(s^{\prime} \mid s, a\right)\left[r\left(s, a, s^{\prime}\right)+\gamma v_{k}\left(s^{\prime}\right)\right] $$
We can now dive deep into the field of Model-free Reinforcement Learning. In this area, we'll focus on the state-action function more, as it needs to be estimated from collected experiences instead of computed with one-step look-ahead. Monte Carlo methods use a simple idea: learning from episodes of raw experience without modeling the environment. The steps are simple: we generate episodes using a policy, we estimate
In temporal differencing, we're exploiting the knowledge that Bellman equations give us to link values in neighbouring states. This is the basis for many RL algorithms, like SARSA and Q-learning.
While in Monte Carlo we updated our value function with
TD is said to be a bootstrapping method, as it uses the current value function to estimate the future value function.
SARSA and Q-learning are two variants of TD: both aim to find the
$$ q_{\pi}(s, a) \leftarrow q_{\pi}(s, a)+\alpha(\underbrace{r\left(s, a, s^{\prime}\right)+q_{\pi}\left(s^{\prime}, a^{\prime}\right)}{\text {new info }}-q{\pi}(s, a)) $$
Q-learning is similar to SARSA, but we have a difference: the update rule now uses the greedy action
$$ q^{}(s, a) \leftarrow(1-\alpha) q^{}(s, a)+\alpha\left(r\left(s, a, s^{\prime}\right)+\max _{a^{\prime}} q^{*}\left(s^{\prime}, a^{\prime}\right)\right) $$
Off-policy methods improve data efficiency: they can re-use all sample for training, while on-policy has to avoid considering samples that were obtained with different policies.
These methods optimise the policy directly, not via a value function. This is useful, for example, in very large state-spaces. We have a parametrised policy
In more abstract terms, we're trying to optimize an objective function $g(\theta):=\mathbb{E}{X \sim p{\theta}}[\phi(X)]=\int \phi(x) p(x \mid \theta) d x$, so we compute the derivative.
$$ \begin{aligned} \frac{d}{d \theta} g(\theta) &=\int \phi(x) \frac{d}{d \theta}(p(x \mid \theta)) d x \ &=\int \phi(x)\left[\frac{\frac{d}{d \theta}(p(x \mid \theta))}{p(x \mid \theta)}\right] p(x \mid \theta) d x \ &=\int \phi(x)\left[\frac{\frac{d}{d \theta}(p(x \mid \theta))}{p(x \mid \theta)}\right] p(x \mid \theta) d x \ &=\int \phi(x)\left[\frac{d}{d \theta}(\log p(x \mid \theta))\right] p(x \mid \theta) d x \ &=\mathbb{E}{X \sim p{\theta}}\left[\phi(X) \frac{d}{d \theta}(\log p(x \mid \theta))\right] \end{aligned} $$
We therefore obtain the Policy Gradient theorem:
$$ \nabla_{\theta} J(\theta)=\nabla_{\theta} \mathbb{E}{\tau \sim \pi{\theta}}[R(\tau)]=\mathbb{E}{\tau \sim \pi{\theta}}\left[R(\tau) \nabla_{\theta} \log p(\tau \mid \theta)\right] $$
This is easier than it looks: the
If we learn the value function in addition to the policy, we can talk about actor-critic: the critic updates value function parameters
In A2C, we're solving a problem that vanilla AC has: the q-value is not very informative, and we'd like to know the advantage more. The update becomes:
$$ \nabla_{\theta} J(\theta) \propto \mathbb{E}{\tau}\left[\sum{t=0}^{T} \nabla_{\theta} \log \pi_{\theta}\left(a_{t} \mid s_{t}\right)\left(q_{\pi_{\theta}}\left(s_{t}, a_{t}\right)-v_{\pi_{\theta}}\left(s_{t}\right)\right)\right] $$
We obviously don't have two separate value functions, but rather use the Bellman equations to compute one-step look-ahead for
For all this time, we have been trying to approximate a function, the value or the action-value function. We though know a really powerful tools for function approximation: Neural Networks. Deep Q-Networks try to approximate the
$$
\theta=\theta-\alpha \nabla_{\theta} L(\theta)
$$
We have a moving target: we're computing our target with the function we're trying to approximate. By using a target network, we're using a lagged network to compute the target, to fix it.
Experiences along a trajectory are highly correlated, and information rich experiences should be used multiple times. Replay memories allow us to simply feed the network random experiences from the past, saving the most recent in to a buffer.
Some experiences are more informative than others: we can use them more often. In PER, we weigh experiences in the buffer by the loss we had when using them to update the network.
The problem we solved with the target network can be solved in a smarter way: instead of using a target network, we create two different networks, and use one to update the other. We therefore avoid maximisation bias by disentagling the updates from the biased estimates.
Dueling DQN splits the Q-values into two different parts, the value function