This repo is just my learning journey and may contain a buggy naive implementations.
- Fix TRPO
- Fix PPO
- Apply GRPO to PPO
- Align some small LLMs to some preference texts using PPO
- Try to build minimum viable reasoner, using GRPO of DeepSeek R1
| Algorithm Name | Intuitive Summary |
|---|---|
| * Q-Learning | Learns a "treat-value table" for situations and actions, choosing actions with highest long-term treat value. |
| * Deep Learning (in Deep RL) | Gives RL algorithms "powerful eyes and brains" using neural networks to understand complex situations. |
| * Policy Gradient | Directly adjusts the "paper airplane's folds" (policy) to fly better, based on flight distance (rewards). |
| * Actor-Critic | Student (Actor) learns from teacher (Critic) feedback on action quality, improving policies faster. |
| * Advantage Actor-Critic (A2C) | Actor-Critic with "extra helpful feedback" - Critic tells "how much better/worse" action was than average. |
| * Soft Actor-Critic (SAC) | Actor-Critic encouraged to be "curious" - entropy bonus rewards diverse actions, making policies robust in uncertainty. |
| * Proximal Policy Optimization (PPO) | Actor-Critic learning in "small, careful steps" - prevents "wild leaps" in policy, ensuring stable, reliable progress. |
| * Deep Q-Learning (DQN) | Q-Learning with Deep Learning "brain" - uses neural networks to estimate treat-values in complex situations. |
| * Prioritized Experience Replay (PER) | RL agent replays past memories, focusing on "most surprising/important" moments (high TD-error) for efficient learning. |
| * Dueling DQN | DQN "brain" split - one part for "situation goodness" (Value), another for "action goodness within situation" (Advantage) - for efficient learning. |
| * Noisy Networks | Agent's "brain" with "internal randomness" - noise in network encourages natural exploration, replacing epsilon-greedy. |
| * Noisy Dueling Double DQN | "All-star DQN" - Combines Deep Learning, Dueling, Double DQN, PER, Noisy Nets for a powerful, improved DQN agent. |
| * Soft Q-Learning (SQL) | Q-Learning encouraging "flexible choices" - "soft" values reward actions probabilistically, promoting exploration. |
| * Distributional DQN (C51) | DQN learning the distribution of "treat-values," not just the average - understanding the range of possible outcomes. |
| * Trust Region Policy Optimization (TRPO) | Policy Gradient with "trust region" - limits policy change per step, ensuring reliable, monotonic improvement like a cautious climber. |
| * Deep Deterministic Policy Gradient (DDPG) | "Deterministic guidance with Critic feedback" - Actor directly controlled by Critic's evaluation in continuous action spaces. |
| * Twin Delayed Deep Deterministic Policy Gradient (TD3) | "Skeptical Twin Critics, Smoothing, Delayed Guidance" - improved DDPG with twin critics, target smoothing, delayed updates for robustness. |
| * Hierarchical DQN (h-DQN) | DQN with a "boss and worker" - Meta-Controller sets high-level goals, Controller executes low-level actions to achieve them. |
| * N-step DQN | DQN using "multi-step learning" - updates Q-values based on rewards over N steps, bridging 1-step TD and Monte Carlo methods. |
| * QR-DQN (Quantile Regression DQN) | Distributional DQN with "quantile view" - represents value distribution using flexible quantiles, adapting to distribution shapes. |
| * IQN (Implicit Quantile Networks) | "Smarter, efficient quantile generator" - learns a function to generate quantiles "on-demand" for any quantile fraction. |
| * FQF (Fully Parameterized Quantile Function) | "Ultimate Distributional RL" - learns to model the entire CDF shape directly, adaptively choosing key quantiles for data-efficient, powerful representation. |
Python3.12
sudo apt install libsdl2-dev swig python3.12-tk
sudo apt install cmake zlib1g-dev libjpeg-dev libboost-all-dev gcc libsdl2-dev wget unzip virtualenv -p python3.12 env && source env/bin/activate && pip install -r requirements.txt git clone https://github.com/Jeetu95/Rocket_Lander_Gym.git
change CONTINUOUS variable in Rocket_Lander_Gym/rocket_lander_gym/envs/rocket_lander.py to False
cd Rocket_Lander_Gym && pip install .https://github.com/google/jax
https://github.com/google/flax
Variables can be vary, change those variables according to your machine specs
PYTHON_VERSION=cp38 # alternatives: cp36, cp37, cp38
CUDA_VERSION=cuda101 # alternatives: cuda100, cuda101, cuda102, cuda110
PLATFORM=manylinux2010_x86_64 # alternatives: manylinux2010_x86_64
BASE_URL='https://storage.googleapis.com/jax-releases'
pip install --upgrade $BASE_URL/$CUDA_VERSION/jaxlib-0.1.51-$PYTHON_VERSION-none-$PLATFORM.whl
pip install --upgrade jax # install jax
pip install --upgrade flax export XLA_PYTHON_CLIENT_ALLOCATOR=platformhttps://github.com/joaogui1/RL-JAX/tree/master/DQN