Proximal Policy Optimization (PPO) for Platform Environment

Overview

This project implements a Proximal Policy Optimization (PPO) agent using PyTorch for a platform environment featuring a parameterized action space (combining both discrete and continuous actions). The implementation is structured across multiple modules, including definitions for actor-critic networks, the PPO algorithm, utility functions, and training/testing scripts.

PPO Agent Structure

The flowchart below illustrates the overall structure and workflow of the Proximal Policy Optimization (PPO) agent implemented in this project:

Installation

Prerequisites

Python 3.6+
PyTorch
Gym
Weights & Biases (wandb)
PyYAML (for configuration management)

Steps

Clone the repository: git clone https://github.com/clairebb1005/rl-challenge
Install dependencies: pip install -r requirements.txt
Install the Platform environment: Follow the instructions at gym-platform.

Usage

Training PPO agent

Configure the training parameters in config.yaml.
Run the training script: python train.py

Testing a trained PPO agent

Run the testing script: python test.py --model_path "model/PPO_PlatformEnv.pth" --num_episodes 100 --render --plot

Modules

Below is an overview of each module and its functionalities.

Root Directory

config.yaml: Configuration file for setting up various parameters of the project.
train.py: Main training script for the PPO agent.
test.py: Script for testing the trained agent.

`src/` Directory

network.py: Defines the neural network architecture for the actor and critic.
ppo.py: Contains the PPOAgent class implementing the PPO algorithm.
rollout.py: Implements the RolloutBuffer for storing states, actions, and rewards.
utils.py: Utility functions including configuration loading and tensor transformations.

`model/` Directory

Stores the trained model generated by the PPO algorithm.

Platform Environment

In the gym-platform environment, the agent navigates a platform world with the goal of reaching a destination while avoiding enemies and gaps. Episodes terminate upon the agent reaching the goal, encountering an enemy, or falling into a gap.

State Space

The state space for the agent consists of the following components:

Agent's position.
Agent's velocity.
Enemy's position.
Enemy's velocity.
Additional features derived from the platform's layout and characteristics.

Action Space

The action space consists of a discrete action and its parameter:

run(dx): Move the agent a certain distance along the x-axis.
hop(dx): Make the agent jump a short distance.
leap(dx): Enable the agent to leap over gaps between platforms.

Reward

A dense reward is based on the distance traveled towards the goal. The cumulative reward is normalized to 1.

Results

Training - Average Reward and Episode Length

During training, both the average reward and average episode length were monitored, as depicted in the graphs below. The figures are based on the 85,000 logging events recorded using wandb.

Evaluation Strategy

For an in-depth evaluation of the trained PPO agent, the following metrics were employed over 200 test episodes:

Average Reward: The agent achieved an average reward of 0.64, indicating its general performance level in the environment.

Reward Variability: The standard deviation of the reward was measured at 0.28, providing insight into the consistency of the agent's performance.

Success Rate: With a success rate of 0.24, the agent was able to meet the defined success criteria (such as reaching the final destination) in approximately 24% of the episodes.

The following metrics (Episode Reward and Episode Length) collectively provide a holistic view of the agent's performance under test conditions.

Note: The episode length metric, while useful, has limitations. Specifically, in scenarios where the agent might repeatedly jump in the same location (especially at the last platform), this metric may not accurately reflect the agent's progress towards reaching the end of the platform.

Snapshot

Below is an example of a successful episode generated by the environment, showcasing the agent's ability to navigate the platform effectively.

Additional Resources

Code Structure Flowchart

For a more detailed understanding of the code and the architecture of the Proximal Policy Optimization (PPO) agent, please refer to the code-structure.pdf flowchart available in the root directory of this repository. This flowchart provides a visual representation of the various components and their interactions within the project.

Contributing

Contributions, issues, and feature requests are welcome.

License

MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
model		model
other_plots		other_plots
src		src
LICENSE		LICENSE
README.md		README.md
config.yaml		config.yaml
requirement.txt		requirement.txt
test.py		test.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Proximal Policy Optimization (PPO) for Platform Environment

Overview

PPO Agent Structure

Installation

Prerequisites

Steps

Usage

Training PPO agent

Testing a trained PPO agent

Modules

Root Directory

`src/` Directory

`model/` Directory

Platform Environment

State Space

Action Space

Reward

Results

Training - Average Reward and Episode Length

Evaluation Strategy

Snapshot

Additional Resources

Code Structure Flowchart

Contributing

License

About

Releases

Packages

Languages

License

clairebb1005/Reinforcement-Learning-Project-Platform

Folders and files

Latest commit

History

Repository files navigation

Proximal Policy Optimization (PPO) for Platform Environment

Overview

PPO Agent Structure

Installation

Prerequisites

Steps

Usage

Training PPO agent

Testing a trained PPO agent

Modules

Root Directory

src/ Directory

model/ Directory

Platform Environment

State Space

Action Space

Reward

Results

Training - Average Reward and Episode Length

Evaluation Strategy

Snapshot

Additional Resources

Code Structure Flowchart

Contributing

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

`src/` Directory

`model/` Directory

Packages