Skip to content

reshalfahsi/swinging-up-acrobot

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

Swinging Up Acrobot with n-Step Q-Learning

colab

The Acrobot is a robotic arm with two links vertically suspended against gravity. It is an underactuated robot, and we can only exert torque on its elbow. Our goal is to raise its last link above a specified height indicated by a horizontal line. To fulfill this objective, we can use the n-step Q-learning algorithm, one of the family of TD(n) algorithms. TD(n) is a multi-step extension of TD learning (e.g., Q-learning). In the context of the Acrobot, the n-step Q-learning algorithm learns to select optimal actions (applying torque at the elbow) based on the current state (joint angles and velocities) and the expected future rewards. We could design the reward function to provide positive rewards for reaching the target height and penalties for inefficient movements or exceeding time limits. TD(n) uses the rewards collected over the next n steps plus the discounted Q-value at the n-th step instead of updating the Q-value based on just the immediate reward and the next state’s Q-value (as in TD(0) or the standard TD learning). This multi-step approach allows for better credit assignment over longer horizons, potentially speeding up learning.

Experiment

A simple experiment has been conducted showcasing Acrobot's movement under the n-step Q-learning policy. It is provided in this notebook.

Result

Reward Curve

reward_curve
A very noisy reward curve in the course of 10201 episodes of Acrobot's training session.

Qualitative Result

The GIF below displays the movement of Arcbot following the learned policy of the n-step Q-learning.

qualitative_acrobot
Acrobot's main challenge: set the last link above the horizontal threshold line as quickly as possible.

Credit