|
3 | 3 |
|
4 | 4 | ### Learning Goals
|
5 | 5 |
|
6 |
| -- Understand the difference between value-based and policy-based Reinforcement LEarning |
| 6 | +- Understand the difference between value-based and policy-based Reinforcement Learning |
7 | 7 | - Understand the REINFORCE Algorithm (Monte Carlo Policy Gradient)
|
8 | 8 | - Understand Actor-Critic (AC) algorithms
|
9 | 9 | - Understand Advantage Functions
|
10 | 10 | - Understand Deterministic Policy Gradients (Optional)
|
11 |
| -- Understand how to scale up Policy Gradient methods using asynchronous actor critic and Neural Networks (Optional) |
| 11 | +- Understand how to scale up Policy Gradient methods using asynchronous actor-critic and Neural Networks (Optional) |
12 | 12 |
|
13 | 13 |
|
14 | 14 | ### Summary
|
|
17 | 17 | - Sometimes the policy is easier to approximate than the value function. Also, we need a parameterized policy to deal with continuous action spaces and environments where we need to act stochastically.
|
18 | 18 | - Policy Score Function `J(theta)`: Intuitively, it measures how good our policy is. For example, we can use the average value or average reward under a policy as our objective.
|
19 | 19 | - Common choices for the policy function: Softmax for discrete actions, Gaussian parameters for continuous actions.
|
20 |
| -- Policy Gradient Theorem: `grad(J(theta)) = Ex[grad(log(pi(s, a))) * Q(s, a)]`. Basically, we move our policy into a direction of more reward. |
21 |
| -- REINFORCE (Monte Carlo Policy Gradient): We substitute a samples return `g_t` form an episode for Q(s, a) to make an update. Unbiased but high variance. |
22 |
| -- Baseline: Instead of measuring the absolute goodness of an action we want to know how much better than "average" it is to take an action given a state. E.g. some states are naturally bad and always give negative reward. This is called the advantage and is defined as `Q(s, a) - V(s)`. We use that for our policy update, e.g. `g_t - V(s)` for REINFORCE. |
23 |
| -- Actor Critic: Instead of waiting until the end of an episode as in REINFORCE we use bootstrapping and make an update at each step. To do that we also train a Critic Q(theta) that approximates the value function. Now we have two function approximators: One of the policy, one for the critic. This is basically TD, but for Policy Gradients. |
24 |
| -- A good estimate of the advantage function in the Actor Critic algorithm is the td error. Our update then becomes `grad(J(theta)) = Ex[grad(log(pi(s, a))) * td_error]`. |
| 20 | +- Policy Gradient Theorem: `grad(J(theta)) = Ex[grad(log(pi(s, a))) * Q(s, a)]`. Basically, we move our policy in a direction of more reward. |
| 21 | +- REINFORCE (Monte Carlo Policy Gradient): We substitute a samples return `g_t` from an episode for Q(s, a) to make an update. Unbiased but high variance. |
| 22 | +- Baseline: Instead of measuring the absolute goodness of an action we want to know how much better than "average" it is to take an action given a state. E.g. some states are naturally bad and always give the negative reward. This is called the advantage and is defined as `Q(s, a) - V(s)`. We use that for our policy update, e.g. `g_t - V(s)` for REINFORCE. |
| 23 | +- Actor-Critic: Instead of waiting until the end of an episode as in REINFORCE we use bootstrapping and make an update at each step. To do that we also train a Critic Q(theta) that approximates the value function. Now we have two function approximators: One of the policy, one for the critic. This is basically TD, but for Policy Gradients. |
| 24 | +- A good estimate of the advantage function in the Actor-Critic algorithm is the td error. Our update then becomes `grad(J(theta)) = Ex[grad(log(pi(s, a))) * td_error]`. |
25 | 25 | - Can use policy gradients with td-lambda, eligibility traces, and so on.
|
26 |
| -- Deterministic Policy Gradients: Useful for high-dimensional continuous action spaces where stochastic policy gradients are expensive to compute. The idea is to update the policy in the direction of the gradient of the action-value function. To ensure exploration we can use an off-policy actor critic algorithm with added noise in action selection. |
| 26 | +- Deterministic Policy Gradients: Useful for high-dimensional continuous action spaces where stochastic policy gradients are expensive to compute. The idea is to update the policy in the direction of the gradient of the action-value function. To ensure exploration we can use an off-policy actor-critic algorithm with added noise in action selection. |
27 | 27 | - Deep Deterministic Policy Gradients: Apply tricks from DQN to Deterministic Policy Gradients ;)
|
28 |
| -- Asynchronous Advantage Actor Critic (A3C): Instead of using an experience replay buffer as in DQN use multiple agents on different threads to explore the state spaces and make decorrelated updates to the actor and the critic. |
| 28 | +- Asynchronous Advantage Actor-Critic (A3C): Instead of using an experience replay buffer as in DQN use multiple agents on different threads to explore the state spaces and make decorrelated updates to the actor and the critic. |
29 | 29 |
|
30 | 30 |
|
31 | 31 | ### Lectures & Readings
|
|
51 | 51 | - REINFORCE with Baseline
|
52 | 52 | - Exercise
|
53 | 53 | - [Solution](CliffWalk REINFORCE with Baseline Solution.ipynb)
|
54 |
| -- Actor Critic with Baseline |
| 54 | +- Actor-Critic with Baseline |
55 | 55 | - Exercise
|
56 |
| - - [Solution](CliffWalk Actor Critic Solution.ipynb) |
57 |
| -- Actor Critic with Baseline for Continuous Action Spaces |
| 56 | + - [Solution](CliffWalk Actor-Critic Solution.ipynb) |
| 57 | +- Actor-Critic with Baseline for Continuous Action Spaces |
58 | 58 | - Exercise
|
59 |
| - - [Solution](Continuous MountainCar Actor Critic Solution.ipynb) |
| 59 | + - [Solution](Continuous MountainCar Actor-Critic Solution.ipynb) |
60 | 60 | - Deterministic Policy Gradients for Continuous Action Spaces (WIP)
|
61 | 61 | - Deep Deterministic Policy Gradients (WIP)
|
62 |
| -- Asynchronous Advantage Actor Critic (A3C) |
| 62 | +- Asynchronous Advantage Actor-Critic (A3C) |
63 | 63 | - Exercise
|
64 | 64 | - [Solution](a3c/)
|
0 commit comments