Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

eos_token_id for Textual Inversion #10754

Open
solim-i opened this issue Feb 10, 2025 · 1 comment
Open

eos_token_id for Textual Inversion #10754

solim-i opened this issue Feb 10, 2025 · 1 comment
Labels
bug Something isn't working stale Issues that haven't received updates

Comments

@solim-i
Copy link

solim-i commented Feb 10, 2025

Describe the bug

Hi, I implemented textual inversion follwowing this link, but I think there is something wrong with eos_token_id in stable-diffusion-v1-5 text encoder config.

The config file is like this, this means eos_token_id == 2:

{
  | "_name_or_path": "openai/clip-vit-large-patch14",
  | "architectures": [
  | "CLIPTextModel"
  | ],
  | "attention_dropout": 0.0,
  | "bos_token_id": 0,
  | "dropout": 0.0,
  | "eos_token_id": 2,
  | "hidden_act": "quick_gelu",
  | "hidden_size": 768,
  | "initializer_factor": 1.0,
  | "initializer_range": 0.02,
  | "intermediate_size": 3072,
  | "layer_norm_eps": 1e-05,
  | "max_position_embeddings": 77,
  | "model_type": "clip_text_model",
  | "num_attention_heads": 12,
  | "num_hidden_layers": 12,
  | "pad_token_id": 1,
  | "projection_dim": 768,
  | "torch_dtype": "float32",
  | "transformers_version": "4.22.0.dev0",
  | "vocab_size": 49408
  | }

but in transformers modeling_clip.py,

        if self.eos_token_id == 2:
            # The `eos_token_id` was incorrect before PR #24773: Let's keep what have been done here.
            # A CLIP model with such `eos_token_id` in the config can't work correctly with extra new tokens added
            # ------------------------------------------------------------
            # text_embeds.shape = [batch_size, sequence_length, transformer.width]
            # take features from the eot embedding (eot_token is the highest number in each sequence)
            # casting to torch.int for onnx compatibility: argmax doesn't support int64 inputs with opset 14
            pooled_output = last_hidden_state[
                torch.arange(last_hidden_state.shape[0], device=last_hidden_state.device),
                input_ids.to(dtype=torch.int, device=last_hidden_state.device).argmax(dim=-1),
            ]

I think this means current code is not compatible with textual inversion (because we just get the embedding of newly added token with token id 49408, not the eos token.

I might be wrong, but it will be really helpful for giving me any comments.

Thank you.

Reproduction

accelerate launch textual_inversion.py
--pretrained_model_name_or_path=$MODEL_NAME
--train_data_dir=$DATA_DIR
--learnable_property="object"
--placeholder_token=""
--initializer_token="dog"
--resolution=512
--train_batch_size=1
--gradient_accumulation_steps=4
--max_train_steps=3000
--learning_rate=5.0e-04
--scale_lr
--lr_scheduler="constant"
--lr_warmup_steps=0 \

System Info

diffusers 0.32.0.dev0

Who can help?

No response

@solim-i solim-i added the bug Something isn't working label Feb 10, 2025
Copy link
Contributor

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot added the stale Issues that haven't received updates label Mar 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale Issues that haven't received updates
Projects
None yet
Development

No branches or pull requests

1 participant