Distillation #104

aburkov · 2025-01-29T05:28:37Z

aburkov
Jan 29, 2025

Hi! Does anyone have any details on how DeepSeek distilled R1 into smaller models? In the technical report, they provide just no information except for saying that they used SFT and the dataset of 800k examples used to train R1.

If they just do SFT of Qwen on these examples, it's not distillation. Distillation would be if they used R1 to generate scores and then finetuned Qwen to simulate these scores.

Who knows something?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distillation #104

{{title}}

Replies: 0 comments

Select a reply

Distillation #104

aburkov Jan 29, 2025

Replies: 0 comments

aburkov
Jan 29, 2025