This work was developed as a Project Work for the Autonomous Agents and Intelligent Robotics course, taught by Professor Giorgio Battistelli, as part of the Master's Degree in Artificial Intelligence at the University of Florence, Italy.
The main objective is to evaluate and compare different Reinforcement Learning algorithms for solving an Autonomous Platoon Control problem by partially reproducing, on a smaller scale, the experimental results obtained in the following reference paper:
Autonomous Platoon Control with Integrated Deep Reinforcement Learning and Dynamic Programming, Tong Liu, Lei Lei, Kan Zheng, Kuan Zhang; 2022.
Autonomous Platoon Control is a highly important task for the future of intelligent transportation systems. Through automated coordination of vehicles in platoon formation, it is possible to optimize traffic flow, reduce fuel consumption, and improve road safety. The main challenge is maintaining an optimal distance between a system of queued vehicles while they adapt to the leader's speed variations.
The setup of this problem follows exactly the one implemented in the reference paper, with the only simplification being the presence of a single agent vehicle and a single preceding vehicle, the leader. All vehicles follow a first-order dynamics:
where
To prevent divergences and unrealistic acceleration spikes that could compromise the agent's training, constraints are imposed on the possible values for the agent's acceleration and action:
The success of the platoon control task strongly depends on maintaining the correct distance between vehicles. In the reference paper, headway is defined as the bumper-to-bumper distance between two consecutive vehicles:
where
At any time instant
where
Optimal platoon control is achieved when each vehicle manages to adjust its motion dynamics to maintain the desired distance from the preceding vehicle over time. Consequently, Platoon Control can be easily transformed into a minimization problem by setting the objective as the minimization, by the agent, of two error values: one for achieving the correct distance from the preceding vehicle, and one for maintaining the correct speed to ensure this desired distance is not only reached but maintained over time.
The state space consists, at each timestep
The action space consists of a single value:
The system evolves according to two distinct discrete dynamic models for the leader and follower:
Leader:
Follower i:
For the leader, the evolution depends only on its current state and control input. For the follower, however, the evolution depends on its own state, its own control input, and the acceleration of the preceding vehicle. This dependence on the predecessor's acceleration allows the follower to anticipate speed variations of the preceding vehicle, thus making the system more stable.
A Huber-like reward function
The parameters
-
$a$ balances the importance of velocity error relative to position error -
$b$ penalizes overly aggressive control inputs, promoting smoother behavior -
$c$ penalizes sudden acceleration changes (jerk), contributing to driving comfort
Given the expected cumulative reward
The reference paper proposes an integrated approach combining Deep Reinforcement Learning and Dynamic Programming, using an algorithm called FH-DDPG-SS. This method is based on DDPG (Deep Deterministic Policy Gradient) and is designed to handle a complex multi-agent system with multiple vehicles in platoon.
In this work, we implemented a simplified single-agent environment, where the agent must adjust its dynamics to match those of a single leading vehicle, which are set in advance. Therefore, we have a context with only two vehicles, one of which is the agent itself. The agent is trained using two different Q-Learning algorithms, whose performances will be compared:
-
Tabular Q-Learning: This represents the most "classical" approach to Reinforcement Learning, where the Q-function is explicitly represented as a table. While in DQL the state space is continuous, in Tabular Q-Learning both state space and action space are uniformly quantized. The Q-Table, having a value for each possible state-action pair, is initialized with random values in the range [-0.1, 0.1], and its update follows the Bellman Equation:
$$Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha[r_t + \gamma \max_{a} Q(s_{t+1}, a) - Q(s_t, a_t)]$$ , where$Q(s_t, a_t)$ is the current Q-value for the state-action pair;$\alpha$ is the learning rate;$r_t$ is the immediate reward;$\gamma$ is the discount factor;$\max_{a} Q(s_{t+1}, a)$ is the maximum possible Q-value in the next state;$[r_t + \gamma \max_{a} Q(s_{t+1}, a) - Q(s_t, a_t)]$ represents the TD error. -
Deep Q-Learning (DQL): Deep Q-Learning extends classic Q-Learning by using a deep neural network to approximate the Q-function, making it possible to use a continuous state space. The implementation for this problem includes uniform quantization of the action space in the interval
$[u_{min}, u_{max}]$ ; the use of an Experience Replay Buffer to store and sample state transitions; a Target Network to stabilize learning and propagate the original platooning task over time; and an ε-greedy policy to balance exploration and exploitation.
The implementation was developed in Python using PyTorch for DQL and NumPy for tabular Q-Learning. Training was monitored through Weights & Biases.
Spiegazione dei vari attributi e metodi per la creazione dell'ambiente, confrontandoli con l'implementazione visibile nel paper di riferimento. Breve spiegazione delle aggiunte che ho deciso di fare nel mio lavoro, legate particolarmente alla possibilità di visualizzare correttamente gli episodi, considerando ogni veicolo non più come un punto ma come un oggetto solido.
Spiegazione sommaria di come ho utilizzato Panda3D per visualizzare ogni episodio + piccola guida su come utilizzare e interpretare la visualizzazione.
Rendering di episodi utilizzando Panda3D
Setup hardware e software; Iperparametri e range vari. Spiegazione della logica che ho adottato per raccogliere i vari risultati e fare i test.
Sequenza di plot con relativa spiegazione.
Mean Episode Reward | |
---|---|
DDPG | -0.0680 |
FH-DDPG | -0.0736 |
HCFS | -0.0673 |
FH-DDPG-SS | -0.0600 |
QL Tabellare | ? |
Deep QL | -0.1639 |
Recap generale dell'esperienza, menzionando i risultati ottenuti da ciascun metodo.