Add mlp support for qwen3vl series and little refactor#957
Add mlp support for qwen3vl series and little refactor#957
Conversation
Signed-off-by: Tcc0403 <76503978+Tcc0403@users.noreply.github.com>
Signed-off-by: Tcc0403 <76503978+Tcc0403@users.noreply.github.com>
Signed-off-by: Tcc0403 <76503978+Tcc0403@users.noreply.github.com>
Signed-off-by: Tcc0403 <76503978+Tcc0403@users.noreply.github.com>
Signed-off-by: Tcc0403 <76503978+Tcc0403@users.noreply.github.com>
I commented on #897 about a regression to fcle increasing training speed. I applied this path locally to change the cuda syncing and make the fcle fsdp2 share aware and it seems to fix the regression and squeeze a little more t/s. It may be worth pulling the thread on this further: f22ce38 |
Thank you, that's an interesting fix! We are planning a 2026 Q1 roadmap, including fsdp2 (multi-gpu) aware testings, optimizations and so on. Feel free to open a PR so we can discuss how we integrate your work align with our roadmap! Regarding |
Summary
This PR aims to fix #956, plus some refactors. Including:
Note that moe layers aren't patched since there will be a major change in transformers v5, see #958.
Testing Done
make testto ensure correctnessmake checkstyleto ensure code stylemake test-convergenceto ensure convergence