Skip to content

Conversation

@dai-weiye
Copy link

…um access

@dai-weiye
Copy link
Author

Summary

This PR modernizes the original DRQN-based agent by:

  1. Introducing a dueling network head on top of the recurrent encoder.
  2. Implementing a Double DQN training scheme with a separate target network.
  3. Providing a TF2‑compatible training path and simple CSV logging for reproducible experiments.

All changes keep the original environment, state/action definitions and cooperative reward design unchanged.

Technical changes

  • drqn.py

    • Replaced the plain DRQN head with a dueling architecture:
      • LSTM encoder over the history of states.
      • Shared fully connected layer.
      • Value stream (V(s)) and advantage stream (A(s,a)) combined as
        (Q(s,a) = V(s) + (A(s,a) - \text{mean}_a A(s,a))).
    • Updated to use tensorflow.compat.v1 APIs so the code runs on modern TF2 / tensorflow‑macOS.
  • train.py

    • Added online (mainQN) and target (targetQN) networks and implemented Double DQN:
      • Use mainQN to select next‑step actions.
      • Use targetQN to evaluate those actions for the target Q values.
      • Periodically sync target parameters from the main network.
    • Added a small CLI interface:
      • --time-slots, --num-channels, --num-users, --attempt-prob.
    • Added CSV logging (results_summary.csv) every 5000 time slots, recording:
      • total time slots, number of channels/users, attempt probability,
      • last time step, cumulative reward, cumulative collisions.

Empirical observation (on my run)

  • Setting: TIME_SLOTS = 100000, NUM_CHANNELS = 2, NUM_USERS = 3.
  • In early windows, cumulative reward per 5000 time slots is a few hundred.
  • In later windows (e.g. windows 10–20), cumulative reward per 5000 time slots increases to ~3000–3800, while cumulative collisions grow roughly linearly.
  • Compared to the original DRQN implementation under the same environment and reward, the dueling Double DQN variant:
    • Converges faster,
    • Achieves significantly higher average throughput in later windows.

I hope this variant can serve as an additional baseline implementation for the paper’s environment.

@dai-weiye
Copy link
Author

For reference, here are two sample windows (each 5000 time slots):
Figure_4
Figure_20_NEW

  • Original DRQN (last window, 5000 slots): cumulative reward is around ~1000.
  • Dueling Double DQN (this PR) (last window, 5000 slots): cumulative reward is around ~3800 with similar collision slope.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant