Skip to content

Applied Reinforcement Learning for parameterized action space in gym-platform environment.

License

Notifications You must be signed in to change notification settings

clairebb1005/Reinforcement-Learning-Project-Platform

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Proximal Policy Optimization (PPO) for Platform Environment

Overview

This project implements a Proximal Policy Optimization (PPO) agent using PyTorch for a platform environment featuring a parameterized action space (combining both discrete and continuous actions). The implementation is structured across multiple modules, including definitions for actor-critic networks, the PPO algorithm, utility functions, and training/testing scripts.

PPO Agent Structure

The flowchart below illustrates the overall structure and workflow of the Proximal Policy Optimization (PPO) agent implemented in this project:

Installation

Prerequisites

  • Python 3.6+
  • PyTorch
  • Gym
  • Weights & Biases (wandb)
  • PyYAML (for configuration management)

Steps

  1. Clone the repository: git clone https://github.com/clairebb1005/rl-challenge

  2. Install dependencies: pip install -r requirements.txt

  3. Install the Platform environment: Follow the instructions at gym-platform.

Usage

Training PPO agent

  • Configure the training parameters in config.yaml.
  • Run the training script: python train.py

Testing a trained PPO agent

  • Run the testing script: python test.py --model_path "model/PPO_PlatformEnv.pth" --num_episodes 100 --render --plot

Modules

Below is an overview of each module and its functionalities.

Root Directory

  • config.yaml: Configuration file for setting up various parameters of the project.
  • train.py: Main training script for the PPO agent.
  • test.py: Script for testing the trained agent.

src/ Directory

  • network.py: Defines the neural network architecture for the actor and critic.
  • ppo.py: Contains the PPOAgent class implementing the PPO algorithm.
  • rollout.py: Implements the RolloutBuffer for storing states, actions, and rewards.
  • utils.py: Utility functions including configuration loading and tensor transformations.

model/ Directory

Stores the trained model generated by the PPO algorithm.

Platform Environment

In the gym-platform environment, the agent navigates a platform world with the goal of reaching a destination while avoiding enemies and gaps. Episodes terminate upon the agent reaching the goal, encountering an enemy, or falling into a gap.

State Space

The state space for the agent consists of the following components:

  • Agent's position.
  • Agent's velocity.
  • Enemy's position.
  • Enemy's velocity.
  • Additional features derived from the platform's layout and characteristics.

Action Space

The action space consists of a discrete action and its parameter:

  • run(dx): Move the agent a certain distance along the x-axis.
  • hop(dx): Make the agent jump a short distance.
  • leap(dx): Enable the agent to leap over gaps between platforms.

Reward

A dense reward is based on the distance traveled towards the goal. The cumulative reward is normalized to 1.

Results

Training - Average Reward and Episode Length

  • During training, both the average reward and average episode length were monitored, as depicted in the graphs below. The figures are based on the 85,000 logging events recorded using wandb.

Evaluation Strategy

For an in-depth evaluation of the trained PPO agent, the following metrics were employed over 200 test episodes:

Average Reward: The agent achieved an average reward of 0.64, indicating its general performance level in the environment.

Reward Variability: The standard deviation of the reward was measured at 0.28, providing insight into the consistency of the agent's performance.

Success Rate: With a success rate of 0.24, the agent was able to meet the defined success criteria (such as reaching the final destination) in approximately 24% of the episodes.

The following metrics (Episode Reward and Episode Length) collectively provide a holistic view of the agent's performance under test conditions.

Note: The episode length metric, while useful, has limitations. Specifically, in scenarios where the agent might repeatedly jump in the same location (especially at the last platform), this metric may not accurately reflect the agent's progress towards reaching the end of the platform.

Snapshot

Below is an example of a successful episode generated by the environment, showcasing the agent's ability to navigate the platform effectively.

Additional Resources

Code Structure Flowchart

For a more detailed understanding of the code and the architecture of the Proximal Policy Optimization (PPO) agent, please refer to the code-structure.pdf flowchart available in the root directory of this repository. This flowchart provides a visual representation of the various components and their interactions within the project.

Contributing

Contributions, issues, and feature requests are welcome.

License

MIT License.

About

Applied Reinforcement Learning for parameterized action space in gym-platform environment.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages