This project implements a Proximal Policy Optimization (PPO) agent using PyTorch for a platform environment featuring a parameterized action space (combining both discrete and continuous actions). The implementation is structured across multiple modules, including definitions for actor-critic networks, the PPO algorithm, utility functions, and training/testing scripts.
The flowchart below illustrates the overall structure and workflow of the Proximal Policy Optimization (PPO) agent implemented in this project:
- Python 3.6+
- PyTorch
- Gym
- Weights & Biases (wandb)
- PyYAML (for configuration management)
-
Clone the repository:
git clone https://github.com/clairebb1005/rl-challenge
-
Install dependencies:
pip install -r requirements.txt
-
Install the Platform environment: Follow the instructions at gym-platform.
- Configure the training parameters in
config.yaml
. - Run the training script:
python train.py
- Run the testing script:
python test.py --model_path "model/PPO_PlatformEnv.pth" --num_episodes 100 --render --plot
Below is an overview of each module and its functionalities.
config.yaml
: Configuration file for setting up various parameters of the project.train.py
: Main training script for the PPO agent.test.py
: Script for testing the trained agent.
network.py
: Defines the neural network architecture for the actor and critic.ppo.py
: Contains the PPOAgent class implementing the PPO algorithm.rollout.py
: Implements the RolloutBuffer for storing states, actions, and rewards.utils.py
: Utility functions including configuration loading and tensor transformations.
Stores the trained model generated by the PPO algorithm.
In the gym-platform environment, the agent navigates a platform world with the goal of reaching a destination while avoiding enemies and gaps. Episodes terminate upon the agent reaching the goal, encountering an enemy, or falling into a gap.
The state space for the agent consists of the following components:
- Agent's position.
- Agent's velocity.
- Enemy's position.
- Enemy's velocity.
- Additional features derived from the platform's layout and characteristics.
The action space consists of a discrete action and its parameter:
- run(dx): Move the agent a certain distance along the x-axis.
- hop(dx): Make the agent jump a short distance.
- leap(dx): Enable the agent to leap over gaps between platforms.
A dense reward is based on the distance traveled towards the goal. The cumulative reward is normalized to 1.
- During training, both the average reward and average episode length were monitored, as depicted in the graphs below. The figures are based on the 85,000 logging events recorded using wandb.
For an in-depth evaluation of the trained PPO agent, the following metrics were employed over 200 test episodes:
Average Reward: The agent achieved an average reward of 0.64, indicating its general performance level in the environment.
Reward Variability: The standard deviation of the reward was measured at 0.28, providing insight into the consistency of the agent's performance.
Success Rate: With a success rate of 0.24, the agent was able to meet the defined success criteria (such as reaching the final destination) in approximately 24% of the episodes.
The following metrics (Episode Reward and Episode Length) collectively provide a holistic view of the agent's performance under test conditions.
Note: The episode length metric, while useful, has limitations. Specifically, in scenarios where the agent might repeatedly jump in the same location (especially at the last platform), this metric may not accurately reflect the agent's progress towards reaching the end of the platform.
Below is an example of a successful episode generated by the environment, showcasing the agent's ability to navigate the platform effectively.
For a more detailed understanding of the code and the architecture of the Proximal Policy Optimization (PPO) agent, please refer to the code-structure.pdf
flowchart available in the root directory of this repository. This flowchart provides a visual representation of the various components and their interactions within the project.
Contributions, issues, and feature requests are welcome.