Skip to content

[Config] Token ID mismatch between config.json and tokenizer_config.json #67

@yuanheng-zhao

Description

@yuanheng-zhao

Issue

Model Config and tokenizer config mismatch

In HF model repo config.json - llm_config section:
https://huggingface.co/inclusionAI/Ming-flash-omni-2.0/blob/main/config.json#L96-L99

"image_patch_token": 157157,
"video_patch_token": 157175,
"image_start_token": 157158,
"video_start_token": 157159,

The video_start_token is 157159,

However, in the tokenizer_config.json and tokenizer.json file, the id is pointing to

Ming/tokenizer_config.json

Lines 2149 to 2156 in 2a0c02a

"157159": {
"content": "</image>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},

which seems to be the image end token id.

Refer to the video start token id in tokenizer config file:

Ming/tokenizer_config.json

Lines 2157 to 2164 in 2a0c02a

"157160": {
"content": "<video>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},

Should we update the video_start_token to 157160 in HF repo config.json?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions