eos_token_id for Textual Inversion #10754

solim-i · 2025-02-10T05:20:05Z

Describe the bug

Hi, I implemented textual inversion follwowing this link, but I think there is something wrong with eos_token_id in stable-diffusion-v1-5 text encoder config.

The config file is like this, this means eos_token_id == 2:

{
  | "_name_or_path": "openai/clip-vit-large-patch14",
  | "architectures": [
  | "CLIPTextModel"
  | ],
  | "attention_dropout": 0.0,
  | "bos_token_id": 0,
  | "dropout": 0.0,
  | "eos_token_id": 2,
  | "hidden_act": "quick_gelu",
  | "hidden_size": 768,
  | "initializer_factor": 1.0,
  | "initializer_range": 0.02,
  | "intermediate_size": 3072,
  | "layer_norm_eps": 1e-05,
  | "max_position_embeddings": 77,
  | "model_type": "clip_text_model",
  | "num_attention_heads": 12,
  | "num_hidden_layers": 12,
  | "pad_token_id": 1,
  | "projection_dim": 768,
  | "torch_dtype": "float32",
  | "transformers_version": "4.22.0.dev0",
  | "vocab_size": 49408
  | }

but in transformers modeling_clip.py,

        if self.eos_token_id == 2:
            # The `eos_token_id` was incorrect before PR #24773: Let's keep what have been done here.
            # A CLIP model with such `eos_token_id` in the config can't work correctly with extra new tokens added
            # ------------------------------------------------------------
            # text_embeds.shape = [batch_size, sequence_length, transformer.width]
            # take features from the eot embedding (eot_token is the highest number in each sequence)
            # casting to torch.int for onnx compatibility: argmax doesn't support int64 inputs with opset 14
            pooled_output = last_hidden_state[
                torch.arange(last_hidden_state.shape[0], device=last_hidden_state.device),
                input_ids.to(dtype=torch.int, device=last_hidden_state.device).argmax(dim=-1),
            ]

I think this means current code is not compatible with textual inversion (because we just get the embedding of newly added token with token id 49408, not the eos token.

I might be wrong, but it will be really helpful for giving me any comments.

Thank you.

Reproduction

accelerate launch textual_inversion.py
--pretrained_model_name_or_path=$MODEL_NAME
--train_data_dir=$DATA_DIR
--learnable_property="object"
--placeholder_token=""
--initializer_token="dog"
--resolution=512
--train_batch_size=1
--gradient_accumulation_steps=4
--max_train_steps=3000
--learning_rate=5.0e-04
--scale_lr
--lr_scheduler="constant"
--lr_warmup_steps=0 \

System Info

diffusers 0.32.0.dev0

Who can help?

No response

The text was updated successfully, but these errors were encountered:

github-actions · 2025-03-12T15:03:14Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

solim-i added the bug Something isn't working label Feb 10, 2025

github-actions bot added the stale Issues that haven't received updates label Mar 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

eos_token_id for Textual Inversion #10754

eos_token_id for Textual Inversion #10754

solim-i commented Feb 10, 2025 •

edited

Loading

github-actions bot commented Mar 12, 2025

eos_token_id for Textual Inversion #10754

eos_token_id for Textual Inversion #10754

Comments

solim-i commented Feb 10, 2025 • edited Loading

Describe the bug

Reproduction

System Info

Who can help?

github-actions bot commented Mar 12, 2025

solim-i commented Feb 10, 2025 •

edited

Loading