Reproduction
the memory in each rank(0-6) is not same, and i find when the train steps increase, the memory will increase much
step 0 use the origin code

Then i write a efficient grpo loss kernel by triton。
step 0

step 5

step 20

System Info
trl = 0.14.0
torch = 2.5.1+cuda12.4
vllm = 0.7.1
Checklist
Reproduction
the memory in each rank(0-6) is not same, and i find when the train steps increase, the memory will increase much
step 0 use the origin code
Then i write a efficient grpo loss kernel by triton。
step 0
step 5
step 20
System Info
trl = 0.14.0
torch = 2.5.1+cuda12.4
vllm = 0.7.1
Checklist