Add dueling Double DQN and TF2-compatible training for dynamic spectrum access #5

dai-weiye · 2025-11-19T10:51:36Z

…um access

dai-weiye · 2025-11-19T10:54:16Z

Summary

This PR modernizes the original DRQN-based agent by:

Introducing a dueling network head on top of the recurrent encoder.
Implementing a Double DQN training scheme with a separate target network.
Providing a TF2‑compatible training path and simple CSV logging for reproducible experiments.

All changes keep the original environment, state/action definitions and cooperative reward design unchanged.

Technical changes

drqn.py
- Replaced the plain DRQN head with a dueling architecture:
  - LSTM encoder over the history of states.
  - Shared fully connected layer.
  - Value stream (V(s)) and advantage stream (A(s,a)) combined as
    (Q(s,a) = V(s) + (A(s,a) - \text{mean}_a A(s,a))).
- Updated to use tensorflow.compat.v1 APIs so the code runs on modern TF2 / tensorflow‑macOS.
train.py
- Added online (mainQN) and target (targetQN) networks and implemented Double DQN:
  - Use mainQN to select next‑step actions.
  - Use targetQN to evaluate those actions for the target Q values.
  - Periodically sync target parameters from the main network.
- Added a small CLI interface:
  - --time-slots, --num-channels, --num-users, --attempt-prob.
- Added CSV logging (results_summary.csv) every 5000 time slots, recording:
  - total time slots, number of channels/users, attempt probability,
  - last time step, cumulative reward, cumulative collisions.

Empirical observation (on my run)

Setting: TIME_SLOTS = 100000, NUM_CHANNELS = 2, NUM_USERS = 3.
In early windows, cumulative reward per 5000 time slots is a few hundred.
In later windows (e.g. windows 10–20), cumulative reward per 5000 time slots increases to ~3000–3800, while cumulative collisions grow roughly linearly.
Compared to the original DRQN implementation under the same environment and reward, the dueling Double DQN variant:
- Converges faster,
- Achieves significantly higher average throughput in later windows.

I hope this variant can serve as an additional baseline implementation for the paper’s environment.

dai-weiye · 2025-11-19T10:56:18Z

For reference, here are two sample windows (each 5000 time slots):

Original DRQN (last window, 5000 slots): cumulative reward is around ~1000.
Dueling Double DQN (this PR) (last window, 5000 slots): cumulative reward is around ~3800 with similar collision slope.

Add dueling Double DQN and TF2-compatible training for dynamic spectr…

b99ba23

…um access

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add dueling Double DQN and TF2-compatible training for dynamic spectrum access #5

Add dueling Double DQN and TF2-compatible training for dynamic spectrum access #5

Uh oh!

dai-weiye commented Nov 19, 2025

Uh oh!

dai-weiye commented Nov 19, 2025

Uh oh!

dai-weiye commented Nov 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add dueling Double DQN and TF2-compatible training for dynamic spectrum access #5

Are you sure you want to change the base?

Add dueling Double DQN and TF2-compatible training for dynamic spectrum access #5

Uh oh!

Conversation

dai-weiye commented Nov 19, 2025

Uh oh!

dai-weiye commented Nov 19, 2025

Summary

Technical changes

Empirical observation (on my run)

Uh oh!

dai-weiye commented Nov 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant