-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gettin gradient of loss during inference #3371
Comments
For my understanding, did you confirm that the gradient being |
@BenjaminBossan
But with changes for distrubuted setup fails:
The loss in this case an empty tensor |
Logits in single gpu setup (without accelearate+deepspeed) is a tensor with grad_fn=. But with multiple gpus there is no grad_fn |
Thanks for providing some code. Unfortunately, I could not get it to run. I always run into errors, like wrong device, dimension error, etc. Each time I fix an error, the next one occurs. Could you please double check that the scripts run the way that you show them? Moreover, I don't see where accelerate enters the picture, could explain? Finally, please show how you call the scripts and what your |
Hi,
My transformer version is 4.43.4 and torch is 2.4.0+cu118 |
Also, this is the minimal code using accelerate which do not work
|
Thanks for the updates. I could make some progress, but could not fully replicate yet. The single GPU script worked for me. For the DS script, I wanted to trim it down to be as close as possible to the single GPU script. This is what I came up with: import torch
import deepspeed
from accelerate import Accelerator
from accelerate.state import AcceleratorState
from transformers import AutoModelForCausalLM, AutoTokenizer
def token_gradients(model, input_ids, targets):
valid_positions = (targets != -100).nonzero(as_tuple=True)[0]
input_slice = slice(0, valid_positions[0].item())
end_input_slice = valid_positions[-1].item()
embeddings = model.get_input_embeddings()
with deepspeed.zero.GatheredParameters(embeddings.weight, modifier_rank=None):
embedding_weights = embeddings.weight
embedding_size = embedding_weights.shape[0]
one_hot = torch.zeros(
input_ids[input_slice].shape[0],
embedding_size,
device=model.device,
dtype=embeddings.weight.dtype
)
one_hot.scatter_(
1,
input_ids[input_slice].unsqueeze(1),
torch.ones(one_hot.shape[0], 1, device=model.device, dtype=embeddings.weight.dtype)
)
one_hot.requires_grad_()
with deepspeed.zero.GatheredParameters(embeddings.weight, modifier_rank=None):
input_embeds = (one_hot @ embeddings.weight)
input_embeds.requires_grad_()
input_embeds.retain_grad()
print('input_embeds grad ',input_embeds.grad, ' input_embeds ',input_embeds.shape)
input_ids = input_ids.cpu().tolist()
#embeddings corresponding to only input ids
embeds = embeddings.weight[input_ids[:end_input_slice+1],:]
full_embeds = torch.cat(
[
embeds[:input_slice.start,:],
input_embeds,
embeds[input_slice.stop:,:]
],
dim=0)
full_embeds = full_embeds.unsqueeze(0)
print('full_embeds ',full_embeds.shape)
logits = model(inputs_embeds=full_embeds).logits
loss = torch.nn.CrossEntropyLoss()(logits[0,:,:], targets[:end_input_slice+1])
accelerator.backward(loss)
return one_hot.grad.clone(), input_embeds.grad.clone()
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B")
optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5, fused=True)
accelerator = Accelerator()
# this line is only necessary because we don't prepare a dataset
AcceleratorState().deepspeed_plugin.deepspeed_config['train_micro_batch_size_per_gpu'] = 8
model, optimizer = accelerator.prepare(model, optimizer)
model.train()
input = torch.tensor([ 1, 894, 29901, 5122, 10753, 304, 14294, 670, 6567,
9098,491, 14051, 10549, 963, 29889, 8449, 19309, 7101, 674, 7738, 278, 1556,
12871, 29973, 13, 22550, 29901, 15589, 5112, 1516]).to(model.device)
target = torch.tensor([ -100, -100, -100, -100, -100, -100, -100, -100, -100,
-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,-100, -100,
-100, -100, -100, 22550, 29901, 15589, 5112, 1516]).to(model.device)
onehot_grad, inputembed_grad = token_gradients(model, input, target)
print(onehot_grad.shape, ' ',inputembed_grad.shape) As you can see, this is almost identical to the single GPU script. When calling this with
No idea how this comes about. Can you reproduce? Regarding your config, it says:
Do you have a specific config that you pass to |
Hi, |
Thanks for the additional info. Using this DS config and setting
So it's not the exact same error you reported initially (gradients being None), but it is similar: Gradients can be calculated on single GPU but not on multi-GPU with DeepSpeed. I'm not knowledgeable about the inner workings of DeepSpeed. But from the stacktrace, the error comes from DeepSpeed. I don't know if accelerate could do anything differently to prevent this error. What I still don't understand is why this needs to run in eval mode. If you don't want the gradients to update the parameters, could you not switch to train mode, calculate the gradients, and once you're finished, zero out the gradients? |
Hi,
|
Let's see how that goes.
Yes, that should work, then |
I am fine-tuning llama 2 using accelerate+deepseed zero3. During evaluation, which is run after every checkpoint step, I need to calculate gradient loss w.r.t certain input ids. As per my understanding the embedding matrix is sharded and when I try to get the gradient, I get an error saying that grad is set to None. Is there a cleaner way to do it using accelerate APIs?
My code:
The text was updated successfully, but these errors were encountered: