Skip to content

Latest commit



120 lines (86 loc) · 6.24 KB

File metadata and controls

120 lines (86 loc) · 6.24 KB

python pytorch sumo Code style: black Ruff License: MIT

This repository implements the use of reinforcement learning for controlling traffic light systems. While the code is abstracted in order to be applied on different scenarios, a real-life implementation is provided for illustration purposes too.
Toolkit-wise, stable-baselines3 is used in conjunction with the Simulation of Urban MObility (SUMO) software for learning on multiple traffic simulations in parallel. Key highlights of this implementation include::

  • Pytorch as backend.
  • Vectorized environments.
  • Frame-stacking.
  • Curriculum learning.
  • Custom conv3d feature extractor.
  • Playable setup for obtaining human baselines.
  • Designed for reproducibility to other sumo networks.

(A legacy keras + tensorflow implementation is still available in the aptly named branch.)


  1. Install sumo software from
  2. Run conda install -f environments/

Quickstart for testing the provided use-case

The traffic lights at a 4-way traffic intersection is controlled by a PPO model. The destinations and origins of the cars, which define the general simulation, are randomized every episode (though we fixed it for the final eval env runs).

The following snapshots illustrate the parameters pertaining to the road network.

For testing the model simply run python -m scripts.rl.test. You can also try your best to beat it running python -m scripts.baseline.human.

The final model acting on the simulation, and the best performing fixed policy as reference are shown below:

RL Example Fixed Policy Example

The results from the different policies below:

If you wish to retrain or explore the training process, check out scripts/rl/

A note on the agent design

In terms of general model improvement decisions, these were the most prominent:

  • Baseline mlp with multi-input spaces.Dict observations:
    • (1, n_actions) for the phase observation vs.
    • (1, n_obs, n_obs) for the speed, position and wait matrices.
  • Dropping the position matrix in favor of vehicle absence encoding in the speed and wait matrices (with vehicle absence as -1, and normal values ranging [0, 1]).
  • The inclusion of the accel matrix for a richer representation.
  • Changing phase encoding to (1, n_obs, n_obs) instead of (1, n_actions).
  • Introduction of weighted (w2) unshaped long-term reward, balanced against the weighted (w1) shaped myopic reward.
  • Transitioning from the above fixed w1/w2 balance, to a curriculum approach for faster convergence.
  • Multi-input cnn treating each matrix separately (though with the same conv block).
  • Single-input cnn with observation types as channels.
  • Frame-stacking and Conv3D introduction for temporal encoding.
  • Self-attention mechanism on depth and channels.

Designs not withheld (yet):

  • Residual blocks
  • (Cross)-attention mechanisms (as we've moved away from the multi-input design)

How to apply to a new use-case:

  • Create a new network and replace the file in the /sumo/*/ folders
  • Change the sumo-env.cfg values accordingly (see also the Quickstart above for some more details), specifically:
    • Find the x- and y-coordinates of your observation window's center (obs_center)
    • Denote your observation window's precision (obs_length) and its size (obs_nrows)
    • Identify the traffic light id to be controlled (tls_id)
    • List the traffic light's incoming lanes (tls_lanes) and non-yellow phases (tls_phases)
    • List the network's sources (rnd_src) and destinations (rnd_dst) You may also need to rename the network and config arguments in the SumoEnv or SumoEnvFactory initialization

Future developments

General clean-up

Get better results

Increase the traffic scenario variability

Generalize to multiple traffic lights

Add multi-(hierarchical)-agent support