👋 Hello !
This repository is the two-week project in Data Science of four people
from le Wagon
bootcamp in Paris. Our aim is to adapt Reinforcement Learning algorithms to solve Gymnasium environments using several different algorithms.
You can clone this repo, and check for yourself how the algorithms perform. All the instructions are available in the setup section.
In this very simple environment, we have :
- the agent state
$s$ , an integer from 0 to 24 representing its square nº. - the possible agent actions
$a$ , here discrete actions$\leftarrow (0), \downarrow (1), \rightarrow (2), \uparrow (3)$ - a reward
$r = +1$ on the square where the gift lies.
Reaching the gift ends an 'episode'. Falling into one of the frozen lakes triggers the end of an episode too with no reward. In the original game, the agent state
We have developed two variants of the Frozen Lake environment :
- One setting a negative reward
$r = -1$ for falling into the lake - One allowing the agent to know the type of the squares next to them
In this more complex environment :
- the agent state is directly a gameplay image (by default).
- the possible actions are discrete :
idle
(0),left
(1),right
(2),gas
(3) andbrake
(4). - the car gets a negative reward
$r=-0.1$ for each frame** and gets a positive reward$r\simeq 20$ for each new track tile reached by the car.
We have always worked in a variant of the original environment, where we choose the observation to be a stack of several successive images (we have chosen 3 images). We did it by hand, before realizing there was a "VectorEnvironment" wrapper doing exactly that very easily 😅: reading the docs saves lives.
The environment also works with continuous actions along three axes :
- left/right in the range
$[-1, ;1 ]$ - gas/no gas in the range
$[0, ; 1]$ - brake/no brake in the range
$[0, ; 1]$
Which means you can both accelerate, brake and turn at the same time : a perfect recipe for drifting !
In this environment :
- The state
$s$ (or observation$o$ ) is given by the position of the car$x$ and its velocity$v$ , both of which are unbounded. - The car action is actually a force
$F \in [-1,1]$ that can be applied to the car at each time step - The rewards are a bit more subtle : a penalty of
$-0.1 F^2$ is applied to prevent the car from applying large forces all the time. An additional reward of$+100$ is applied when the car moves past the flag. -
Termination occurs when the flag is reached, or when
$1000$ steps are accomplished.
In this environment :
- The observation
$o$ (equal to the state) contains 8 variables :- the
$x,y$ positions [continuous] - the
$v_x, v_y$ linear velocities [continuous] - the angle
$\theta$ and angular velocity$\dot{\theta}$ of the lander [continuous] - two variables signalling contact between the legs of the lander and the ground [boolean].
- the
- The actions are discrete : do nothing [0], fire left orientation engine [1], fire main engine (upwards thrust) [2], fire right orientation engine [3]
- Rewards are a bit elaborate, but they encourage the agent to get closer to the ground, to not go too fast and land safely -- and smoothly -- on the surface.
- The lander starts at the centre of the "viewport", but a random initial force it applied onto it.
- Termination occurs when the lander crashes, gets outside of the field of view, or reaches "not awake" states.
It is possible, as in Car Racing, to make a continuous variant of this environment, in which case we get two actions, one for the main engine (from -1 to 1) and one for the side engines (again from -1 to 1). As mentioned in the official docs, the sign of the side engine action will determine which lateral booster is used.
We are using several algorithms to solve these environments, from the simplest (Q-learning) which uses no neural network, to the most complex (Proximal Policy Optimization).
Algorithm | Neural Network ? | Infinite States | Continuous Actions |
---|---|---|---|
Q-Learning | ❌ | ❌ | ❌ |
Deep-Q Network (DQN) | ✅ | ✅ | ❌ |
Advantage Actor Critic (A2C) | ✅ | ✅ | ✅ |
Proximal Policy Optimization (PPO) | ✅ | ✅ | ✅ |
Some of them can deal with having a virtually infinite amount of states or observations. In Frozen Lake, the number of states is small (the number of squares) whereas for Car Racing, the states would be all the different possible images returned by the game, which is nearly infinite.
Some algorithms can not manage continuous actions, since they rely on computing all the possible actions from a given state and checking which one would lead to the highest reward. This is the case of Q-learning and DQN, which are therefore not adapted to the continuous variant of Car Racing or Mountain Car.
To understand a bit how these algorithms work, we need to introduce a few quantities.
A policy
Following this policy, we can define the state value
Imagine starting from a checkpoint at state
$s$ at$t=0$ and let the game run in 'auto-mode' using the policy$\pi$ until you reach the end of the game. We sum all the rewards obtained by the agent and call the result the state_value of the policy$\pi$ starting from state$s$ . In mathematical terms, we write it the following way:
We suppose that everything on the right of the vertical bar
The action value function
That function would be a two-dimensional array of size
-
Frozen Lake,
$Q$ would be a ($25 \times 4$ ) array, manageable -
Car Race, the images defining the states are of size
$(96 \times 96 \times 3)$ with each pixel taking a value between 0 and 255, so theoretically we have ($256^{96 \times 96 \times 3 } \times 5$ ) states, which is, for all practical purposes,$+\infty$ . This is why regular$Q$ -learning is impossible on Car Race.
I think the critical question at this stage is : how does computing Q-values help us teach an agent how to play games ?. Let's assume that the
The main issue is that we do not know initially the true state action values
But ... we don't really have a policy yet ! It is then hard to compute
There are no additional terms to the sum since the game (episode) ends after the elf reaches the gift. We could then update the
But how do we compute the values for the other actions, those that do not end the episode ? Let's start by re-writing the 'real'
What do these future rewards in the sum correspond to ? They do correspond to choosing the optimal policy for every step
We can then rewrite the previous sum as :
Now, for the magic part : we arbitratily decide to apply the above equation, but with our
By doing that,
If you want to know more, I can only recommend you read about the excellent intro to Q-learning by Maxime Labonne on Medium. It notably shows how the Q-table is updated 'in the real world'.
Deep-Q networks were introduced in a 2013 conference article
to solve one burning question : how do you build a
The resulting Deep-Q network can either be fully connected if the number of states
is finite, but can also include convolutional neural network layers (CNNs) if
the state
How do we optimize / update the network ? One way to optimize the model is
to create a loss
Minimizing this loss (here, a MAE, but it could also be a Huber loss, or a MSE ...)
with a learning rate
There are additional, more practical challenges before actually implementing a deep-Q network, notably :
-
How to initially explore the environment instead of systematically trying to make the best decision. This can be done using
$\epsilon$ -greedy policies, i.e. making decisions at random with a probability$\epsilon$ . Check out the HuggingFace webpage on the subject. -
When do we update the deep-Q network ? Using which data ? Instead of simply relying on the data from the previous step, we can build a replay memory buffer that saves the most recent (say
$10,000$ ) actions taken. We can then draw samples from that replay buffer after each step and use this "batch" to continuously train the deep-learning network. Check out the HuggingFace section on the subject on the Deep-Q network page. -
How do we prevent too much correlation between the Target part and the Current part of the network? One way to deal with this is to decouple these two networks, and only update the target every
$N$ steps. We can also decide to replace the target network more progressively than the policy network to stabilise the convergence of the algorithm.
More info on Deep-Q networks from the very nice article of Yuki Minai on the subject.
In Deep-Q-learning, we only have one neural network, that takes as input the states (or observations) and returns scores for each possible decision, from which he best estimated decision can be taken. This means that :
-
We need to compute this Q score for every possible action to select the best action.
-
We have to follow the best determined action all the time once it has been determined if we consider that we no longer explore with a probability
$\epsilon$ .
These two points raise a few issues :
-
Let's imagine now that our actions are continuous, for instance the angle of a steering wheel if you are driving. How do we know what the best angle is if you want to turn left ? We could discretize the possible actions (take 360 values) and build a DQN based on these 360 values. This becomes a bit of a pickle if you start having to take into account braking (another 100 values, say) and gas (another 100 values). If you are allowed to brake, accelerate and turn at the same time (a good starting point for drifting), you have
$36,000,000$ possible actions. It does not really sound reasonable to train a model with that many outputs. -
For some specific states
$s$ , it might be optimal to act with a bit of uncertainty instead of being completely deterministic. A good example is the one presented in the HuggingFace lecture on Policy gradients. Imagine a bicycle that moves in a small room filled with a trophy (good) and two spiders (bad). It only knows the type of the squares next to itself (a wall, an empty space, etc.). In this case, a Q-learning algorithm should eventually lead to the same score for the squares$1$ and$2$ , and there is a real possibility that the agent remains stuck regardless of what the agent decides to do (systematically) on squares 1 and 2.
A possible initial state for the bicycle (agent), with squares 1 and 2 being identical from the point of view of the agent.
Case where the best action of the trained model is determined to be going left (or right), and we start a new episode at the left (or right) of the room. The agent is stuck in both cases.
One idea to solve the issue would be to compute probabilities to make decisions based on a given state, instead of giving them a score. We would then need a method to modify these probabilities based on the future rewards that these decisions lead to. In that case, we can hope that the trained model agent have a 50/50 probability of going left or right when they are on squares
A 'simple' -- everything is relative -- implementation of this idea is the REINFORCE algorithm. We have decided to implement a slightly more complex algorithm, called Advantage Actor Critic that solves a few issues with REINFORCE so that we can solve more complex problems.
The A2C algorithm uses two neural networks, an actor network that makes decitions and a critic network that judges them.
It is tasked with making decisions based on the agent observations
It is tasked with judging the actions of the actor. They could be in charge, for instance, of estimating the action value
Here, the state
How does the actor improve itself ? And how can the critic improve at all ?!? Two excellent questions ! The actor will update the weights of its model, so the weights of
The actor will then adapt the weights of its policy
The critic has to get as good as possible at knowing the rewards the actor will get when selecting a given action. In a way similar to what we saw with deep-Q networks, we can build a similar loss of the following type (here, we take the mean squared error) :
When we minimise that loss, we become better at predicting the reward we get, since we are continuously 'injecting some truth' (the
$$ \mathcal{L}{\rm critic} = \left \vert r_t + \gamma \left [ r {t+1} + \gamma V^{\rm est}(s _{t+2}) \right ] - r_t - \gamma V^{\rm est}(s _ {t+1}) \right \vert ^2 $$
Rearranging some elements, we end up with :
So our loss is essentially the square of the advantage
With the formula we currently use for the advantage
We can however decide to update the network at the end of every episode only. In that case, we can benefit from a much better estimate of the value of a step, since we no longer have to use an estimate of
You would think that after all these improvements and tweaks, any problem would be easily solvable using an A2C model. Once again, researchers stumbled upon issues, this time related to large policy updates. If the critic loss or the actor loss is massive, e.g. because of an unexpected large positive or negative reward, the corresponding policy change will also be dramatic, unless you choose a very low learning rate ; but in that case, you will basically learn nothing until you hit that large reward again. So how do we solve that ? Two main ideas have been floated around, the first being the Trust-Region Policy Optimization, and Proximal Policy Optimization.
Remember the actor loss in the A2C algorithm -- the inverse of its gain
For now, there is no notion of policy change in the loss of the actor, so we don't really know how different the new policy will be compared to the old policy. One way to remediate this is to change the loss to include the probabilities associated with a previous policy :
The numerator of the fraction
Let's examine what happens during an optimization step, so when we try to update the policy
- one where we are somehow less likely now to perform this action than we used to, i.e.
$\pi(s,a) < \pi^{\rm old} (s,a)$ . In this case (in green), we obviously want to increase the probability of performing this action more often - the second case (in orange) where we are as likely as before to take this action. This case is defined by a region
$1 - \epsilon < \Upsilon(s,a) < 1 + \epsilon$ , we still decide to increase$\Upsilon$ , as we did in case 1. - in the third case (in red), we are already significantly more likely to perform the action than before ... so maybe let's not beat a dead horse, and not increase further the value of
$\Upsilon$ . This was the case that A2C was struggling with.
Naturally, when the sign of
Note that I use the opposite of the objective function used in most texts :
The critic part of the PPO algorithm has the same job as in the A2C algorithm, it has to model the value of a state in an accurate fashion, so we will simply compute the advantage, and its square will be -- again -- the loss to minimize. In this case, we do not take advantage of the Bellman equation and we rather use the true value of the discounted rewards for the future state:
We cannot have access to the sum unless, as we said, we have some memory of the states that have been accessed, the rewards that have been collected. We cannot update the model at every step like A2C because of this, and also -- as mentioned above -- because we have to compare
That is an excellent question. Also ... with what data ? In PPO, we collect a sequence of data (it can span over multiple episodes), and we divide it into mini-batches. We then train our network on these minibatches several times (several epochs), so our model policy
Help ! People talk about on and off-policy, policy-based and model-based agents. What do they mean ?
There is no point repeating a clear explanation, so I will just leave you with the excellent explanation of Tomasz Bartkowiak on StackExchange. To summarise the networks we have seen so far, our implementation of
- DQN is
- Off-policy : we are using the replay memory to improve the agent
- Value-based : our neural network computes Q values for the states (and actions) instead of directly choosing actions
- A2C is
- On-policy : we are continuously updating the most recent version of the agent based on the most recent environment info (states, rewards)
- Actor-critic : yes, we explicitly compute a policy with the actor network, ... but the critic computes the value of each state, so it is a mix between policy and value-based.
- PPO is :
- On-policy-ish : we are letting our agent play with the most recent policy (without updating it) before doing our update with the minibatches. So maybe when the model is being updated and after a few epochs, we are slightly off-policy, but we quickly revert to on-policy when we start sampling the environment again. Plus, we cannot deviate too far from our previous policy.
- Actor-critic : yet again, we have both a policy network and a value network.
- REINFORCE, that I briefly mentioned, is :
- On-policy : as with A2C, REINFORCE continuously updates the agent.
- Policy-based : here, we have no critic, we only update the policy network based on an objective function.
We work with Python v3.11.9 and our virtual environment is named rl
. I used to work with pyenv
with the virtualenv
extension and direnv
to make it work. That stack works great on Linux, because then you can just switch between different virtual environments when you are cd'ing into project folders.
However it does not work so well on Windows, because the virtual environments would not work so well. If you are on Windows, you can still install pyenv-win to have several concurrent Python versions, but it is not as convenient.
Let's create our virtual environment (replace the \
with a /
for Linux and MacOS, but frankly just use direnv
then !)
cd ~
python -m venv .venv\rl
On Windows and Powershell we can then activate it this way :
~\.venv\rl\Scripts\Activate.ps1
You should then see a (rl)
next to your command prompt if the activation has worked.
You can again create your virtual environment. The syntax is :
pyenv virtualenv 3.11.9 rl
Then, you can cd to the from-a-to-b-with-rl
folder, and check if it is activated.
If the virtual environment is not activated automatically upon entering the folder you can run:
pyenv local rl
The rl
virtual environment has a few dependencies, notably :
- pytorch for the RL learning
- numpy to handle a few numerical outputs
- pandas because dataframes are neat
- gymnasium
- pygame
- moviepy to save video from the agent interacting with the environment
- matplotlib because we ultimately want graphs
- [swig] to wrap some C/C++ code with Python
- [box2d] to run some of the environments. We install it with a few additional options since otherwise it was just not installing on some of our machines.
You can then decide to install the package itself (XXX Note, for now, nothing interesting is installed except from the dependencies XXX):
pip install .
Or just decide to install the requirements.txt
file :
pip install -r requirements.txt
If your GPU is CUDA capable, you will have to adapt your rl
environment. If you are on Windows, you can type :
pip uninstall torch
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
If you are on Linux, you can do :
pip uninstall torch
pip3 install torch torchvision torchaudio
If you want to monitor the GPU in the terminal, you can type
nvidia-smi -l 1