Finetune from pre-trained models #1300

vwxyzjn · 2025-06-15T21:12:07Z

This PR adds two main changes:

add a max_seq_len allowing the model to load a pre-trained llama 3.1 8B checkpoint. Note that I had to revert to the old checkpoint code. Otherwise, I got a weird error trace shown at the bottom of this PR description.
allow for starting from a checkpoint without enable_checkpoint. Use case: the user might want to do fine-tuning without saving intermediate checkpoints.
- I disabled it partially because it takes 6 mins to save an 8B model w/ non-async mode. See Slow checkpoint saving time (6 mins to save an 8B model checkpoint in sync mode) #1301

Tested with the following commands:

# Download the tokenizer and model weights
rm -rf tmp
uv run huggingface-cli download meta-llama/Llama-3.1-8B original/tokenizer.model --local-dir tmp
uv run huggingface-cli download meta-llama/Llama-3.1-8B original/consolidated.00.pth --local-dir tmp
uv run huggingface-cli download meta-llama/Llama-3.1-8B original/params.json --local-dir tmp
# Convert the model weights to the DCP format and move it and the tokenizer to the assets folder
mkdir -p assets/tokenizer && cp tmp/original/tokenizer.model assets/tokenizer/Meta-Llama-3.1-8B-tokenizer.model
uv run python -m scripts.convert_llama_to_dcp tmp/original/ assets/models/dcp/llama3.1-8B

Then you can fine-tune from the checkpoint:

CONFIG_FILE="./torchtitan/models/llama3/train_configs/llama3_8b.toml" uv run ./run_train.sh \
  --model.tokenizer_path assets/tokenizer/Meta-Llama-3.1-8B-tokenizer.model \
  --training.max_seq_len 131072 \
  --checkpoint.initial_load_path "assets/models/dcp/llama3.1-8B" \
  --profiling.no_enable_profiling \
  --activation_checkpoint.mode full \
  --training.global_batch_size 64 \
  --lr_scheduler.warmup_steps 40 \
  --optimizer.lr 1e-5

Error trace with the new load checkpoint code

If I don't revert back to the old checkpointing code I would get

    File "/home/ubuntu/code/thirdparty/torchtitan/.venv/lib/python3.13/site-packages/torch/19:41:24 [255/770]
oint/utils.py", line 465, in inner_func
      return func(*args, **kwargs)
    File "/home/ubuntu/code/thirdparty/torchtitan/.venv/lib/python3.13/site-packages/torch/distributed/checkp
oint/state_dict_loader.py", line 177, in load
      _load_state_dict(
      ~~~~~~~~~~~~~~~~^
          state_dict=statetful_sd,
          ^^^^^^^^^^^^^^^^^^^^^^^^
      ...<3 lines>...
          planner=planner,
          ^^^^^^^^^^^^^^^^
      )
      ^
    File "/home/ubuntu/code/thirdparty/torchtitan/.venv/lib/python3.13/site-packages/torch/distributed/checkp
oint/state_dict_loader.py", line 234, in _load_state_dict
      central_plan: LoadPlan = distW.reduce_scatter("plan", local_step, global_step)
                               ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/home/ubuntu/code/thirdparty/torchtitan/.venv/lib/python3.13/site-packages/torch/distributed/checkp
oint/utils.py", line 196, in reduce_scatter
      all_data = self.gather_object(local_data)
    File "/home/ubuntu/code/thirdparty/torchtitan/.venv/lib/python3.13/site-packages/torch/distributed/checkp
oint/utils.py", line 135, in gather_object
      dist.gather_object(
      ~~~~~~~~~~~~~~~~~~^
          obj=object,
          ^^^^^^^^^^^
      ...<2 lines>...
          group=self.group,
          ^^^^^^^^^^^^^^^^^
      )
      ^
    File "/home/ubuntu/code/thirdparty/torchtitan/.venv/lib/python3.13/site-packages/torch/distributed/c10d_l
ogger.py", line 81, in wrapper
      return func(*args, **kwargs)
    File "/home/ubuntu/code/thirdparty/torchtitan/.venv/lib/python3.13/site-packages/torch/distributed/distri
buted_c10d.py", line 3139, in gather_object
      input_tensor, local_size = _object_to_tensor(obj, current_device, group)
                                 ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/home/ubuntu/code/thirdparty/torchtitan/.venv/lib/python3.13/site-packages/torch/distributed/distri
buted_c10d.py", line 2935, in _object_to_tensor
      _pickler(f).dump(obj)
      ~~~~~~~~~~~~~~~~^^^^^
  TypeError: cannot pickle code objects

tianyu-l

Having an instruction / a usage example for the script convert_llama_to_dcp could be helpful.

allow for starting from a checkpoint without enable_checkpoint. Use case: the user might want to do fine-tuning without saving intermediate checkpoints.

We can think more about the UI, e.g. separate enable_load from enable_save. However, in your case, can't you just specify the interval to be a very large number?

I disabled it partially because it takes 6 mins to save an 8B model w/ non-async mode.

We will need to root cause and solve the issue.

tianyu-l · 2025-06-16T03:16:33Z

README.md

+```bash
+CONFIG_FILE="./torchtitan/models/llama3/train_configs/llama3_8b.toml" uv run ./run_train.sh \
+  --model.tokenizer_path assets/tokenizer/Meta-Llama-3.1-8B-tokenizer.model \
+  --training.max_seq_len 131072 \


I wonder if it's necessary to create this config -- how is it different from specifying --training.seq_len 131072?

One example use case is when I don't actually have documents up to seq_len 131072, but the pre-trained model has a default seq_len of 131072.

If I understand correctly, the seq_len field is only used when generating freqs_cis https://github.com/pytorch/torchtitan/blob/main/torchtitan/models/llama3/model/model.py#L397
This should be input agnostic, so I feel you can just specify --training.seq_len to be however long you need (as long as it doesn't exceed model capability).
Let me know if it's not the case.

Ah I see the problem. The issue is that HuggingFaceDataset uses --training.seq_len, so the packed dataset also has the same length.

In that case, we should prob re-use the same seq_len, but allowing the HuggingFaceDataset to use a separate packed_len. WDYT?

My question is why you'd wish them to be different.
The requirement is that: if HF dataset uses seq_len_hf, then we need to have seq_len_transformer >= seq_len_hf to make sure the freqs_cis is init with enough length.
But we don't need seq_len_transformer > seq_len_hf (or do we?), so it can just be seq_len_transformer = seq_len_hf = training.seq_len.

I am worried about setting seq_len_transformer=131072 would make training OOM vs seq_len_transformer =8192.

However, it appears I need to set seq_len_transformer=131072 if I am trying to load a pretrained model such as llama 3.1 8B. Is this correct?

it appears I need to set seq_len_transformer=131072 if I am trying to load a pretrained model such as llama 3.1 8B. Is this correct?

oh I see your worry.

I don't think it should be the case. Like I said, the only place the seq_len matters in transformers is for freqs_cis which is a non-persistent buffer and shouldn't be included the model checkpoint.
(Previously in torchtitan it could be, but after https://github.com/pytorch/torchtitan/pull/1236/files#diff-27a108fa6d4885d9c66306785cb36029c0b4f5a1542e63ae24e84eb7e9a273d1R87 it shouldn't.)

For your finetuning job, the model capability shouldn't be affected by specifying a smaller max_seq_len.

BTW, you could consider use CP in torchtitan for long sequence finetuning.

For your finetuning job, the model capability shouldn't be affected by specifying a smaller max_seq_len.

~~I guess an important question is this: if we have a pretrained model with seq_len=131072, should we always compute freqs_cis using seq_len=131072?~~

~~If the answer is yes, it would make sense to set up an arg called seq_len_transformer (in place of my current max_seq_len, and set it to 131072 when loading llama 3.1 8B.~~

I see. It looks like because of how freqs_cis is used in reshape_for_broadcast, it's fine if we calculate it without the full 131072. Then it doesn't make sense to save / load from it.

Thanks. I will adjust the PR accordingly.

tianyu-l · 2025-06-16T03:18:15Z

README.md

+export HF_TOKEN=... # get your HF token from https://huggingface.co/settings/tokens
+# Download the tokenizer and model weights
+rm -rf tmp
+uv run huggingface-cli download meta-llama/Llama-3.1-8B original/tokenizer.model --local-dir tmp


We covered the downloading of tokenizer above, in section "Downloading a tokenizer".

Yeah true, I was gonna ask do you want to replace that with huggingface-cli commands? We could use it for both downloading tokenizer and the actual models.

oh I see, maybe let's first put the complete huggingface-cli flow inside finetune.md. If people get used to it, we can change the version in main README later.

Sounds good!

tianyu-l · 2025-06-16T03:25:34Z

README.md

@@ -114,6 +114,36 @@ Llama 3 8B model locally on 8 GPUs
 CONFIG_FILE="./torchtitan/models/llama3/train_configs/llama3_8b.toml" ./run_train.sh
 ```

+### Fine-tuning from an existing checkpoint


Can we put this under docs/finetune.md instead of main README? We can create a link to the doc around here.

Of course. Will do.

tianyu-l · 2025-06-16T03:27:54Z

README.md

+uv run huggingface-cli download meta-llama/Llama-3.1-8B original/consolidated.00.pth --local-dir tmp
+uv run huggingface-cli download meta-llama/Llama-3.1-8B original/params.json --local-dir tmp
+# Convert the model weights to the DCP format and move it and the tokenizer to the assets folder
+mkdir -p assets/tokenizer && cp tmp/original/tokenizer.model assets/tokenizer/Meta-Llama-3.1-8B-tokenizer.model
+uv run python -m scripts.convert_llama_to_dcp tmp/original/ assets/models/dcp/llama3.1-8B


using uv is fine but as general instruction we shouldn't assume users have to use uv

instead of tmp and assets/models/dcp which looks arbitrarily chose, let's try to use generic placeholders.

Ah forgot to remove the uv part. What do you mean by generic placeholders?

like, instead of tmp, use [original_model_dir], dcp_model_dir, [tokenizer_dir] so that people know what to replace

tianyu-l · 2025-06-16T03:31:13Z

torchtitan/components/checkpoint.py

oh we shouldn't just revert the changes -- instead we should investigate the root cause

cc @fegin pls take a look at if recent changes break anything

torchtitan/models/llama3/model.py

lkhphuc · 2025-06-16T06:37:03Z

    File "/home/ubuntu/code/thirdparty/torchtitan/.venv/lib/python3.13/site-packages/torch/distributed/distri
buted_c10d.py", line 2935, in _object_to_tensor
      _pickler(f).dump(obj)
      ~~~~~~~~~~~~~~~~^^^^^
  TypeError: cannot pickle code objects

I find that if you use python 3.13, any error in loading the checkpoint like missing keys, shape mismatch will always results in this error.
If you use with python 3.12 or smaller, it would throws the actual error why it fails.

In addition, you can not load a checkpoint created in venv with python 3.13 in a venv with python 3.12. It resulted in some internal python error _Pathlib.___ something missing.

fegin · 2025-06-16T17:15:58Z

For checkpointing, I don't think we should separate enable_save from enable_load. It is too fine-grain and I don't think there is a real production use case, you will still need to save the final checkpoint anyway.

The first step checkpoint is always enabled to allow users to quick understand if there are errors. In the past, several users complain that they have to train X steps before they discover there are checkpointing issues. I think we can disable this feature or make it configurable.

Async checkpointing being slow is a bug we need to figure out.

vwxyzjn · 2025-06-16T21:15:03Z

The first step checkpoint is always enabled to allow users to quick understand if there are errors.

This is such a great point! My first reaction was "why am I saving on step 1?"

I wonder if we should log something like "saving a first checkpoint to ensure it works".

fegin · 2025-06-17T04:30:16Z

You can consider to use #1310 to avoid checkpoint overhead.

Finetune from pre-trained models

1160555

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 15, 2025

vwxyzjn added 4 commits June 15, 2025 14:14

Update the readme

6b03d8d

update docs

1b7e1d1

quick fix

3a5cdca

revert to the old checkpoint code

763080c

vwxyzjn mentioned this pull request Jun 15, 2025

Slow checkpoint saving time (6 mins to save an 8B model checkpoint in sync mode) #1301

Open

vwxyzjn marked this pull request as ready for review June 15, 2025 21:43

vwxyzjn requested review from tianyu-l, fegin and wwwjn as code owners June 15, 2025 21:43

tianyu-l requested changes Jun 16, 2025

View reviewed changes

vwxyzjn mentioned this pull request Jun 20, 2025

Support finetuning from a pretrained model #1321

Open

Finetune from pre-trained models #1300

Are you sure you want to change the base?

Finetune from pre-trained models #1300

Uh oh!

Conversation

vwxyzjn commented Jun 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Error trace with the new load checkpoint code

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vwxyzjn Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lkhphuc commented Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fegin commented Jun 16, 2025

Uh oh!

vwxyzjn commented Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fegin commented Jun 17, 2025

Uh oh!

Uh oh!

vwxyzjn commented Jun 15, 2025 •

edited

Loading

vwxyzjn Jun 17, 2025 •

edited

Loading

lkhphuc commented Jun 16, 2025 •

edited

Loading

vwxyzjn commented Jun 16, 2025 •

edited

Loading