Skip to content

added better guidance for if deprecated tokenizer path fails #1568

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Aug 15, 2025

Conversation

wesleytruong
Copy link
Contributor

Adds a check to see if the old tokenizer path is being used when tokenizer path fails. This way it can provide guidance to people to update to the supported hf_assets_path and download_hf_assets.py script

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Aug 14, 2025
@wesleytruong wesleytruong changed the title added better guidance for when tokenizer path fails added better guidance for if deprecated tokenizer path fails Aug 14, 2025
Copy link
Contributor

@XilunWu XilunWu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but i have some suggestions.

if "assets/tokenizer" in tokenizer_path:
raise FileNotFoundError(
"Detected deprecated ./assets/tokenizer path. Remove --model.tokenizer_path "
"and download to --model.hf_assets_path using ./scripts/download_hf_assets.py"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


if "assets/tokenizer" in tokenizer_path:
raise FileNotFoundError(
"Detected deprecated ./assets/tokenizer path. Remove --model.tokenizer_path "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about adding more info about path deprecation:
"detected ... path that's deprecated in #1526"

Comment on lines 168 to 178
if "assets/tokenizer" in tokenizer_path:
raise FileNotFoundError(
"Detected deprecated ./assets/tokenizer path. Remove --model.tokenizer_path "
"and download to --model.hf_assets_path using ./scripts/download_hf_assets.py"
)
else:
raise FileNotFoundError(
f"No supported tokenizer files found in '{tokenizer_path}'. "
f"Available files: {available_files}. "
"Looking for: tokenizer.json, vocab.txt+merges.txt, or vocab.json+merges.txt"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need to change the check here because the logic was:

  • check "tokenizer_path" folder for available tokenizer files.

While the new logic we add is to remind users of the deprecation of old tokenizer path. Since we have verified the correctness of "tokenizer_path", there's no need to perform it again -- if the execution goes to the else branch, that simply means the specified tokenizer file is not available and the existing information is detailed enough.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the logic is not required as @XilunWu mentioned, please removed it. If the logic is required, this if/else plus the error messages look almost identical to the error handling block above, please consider to make it as a function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need to change the check here because the logic was:

  • check "tokenizer_path" folder for available tokenizer files.

While the new logic we add is to remind users of the deprecation of old tokenizer path. Since we have verified the correctness of "tokenizer_path", there's no need to perform it again -- if the execution goes to the else branch, that simply means the specified tokenizer file is not available and the existing information is detailed enough.

Ok I think I'll remove the second check. I added it because I wanted to cover the use case of people who may have this path due to using legacy script but are missing the tokenizer file, but I realize people will probably either not have this path at all or have the path with the files as well.

Copy link
Contributor

@fegin fegin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stamp to unblock, please address the comments before landing.


if "assets/tokenizer" in tokenizer_path:
raise FileNotFoundError(
"Detected ./assets/tokenizer path which was deprecated in https://github.com/pytorch/torchtitan/pull/1540. \n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason for the extra space?

@wesleytruong wesleytruong force-pushed the tokenizer_path_warning branch 2 times, most recently from b768745 to 58bd436 Compare August 14, 2025 21:05
@wesleytruong wesleytruong force-pushed the tokenizer_path_warning branch from 58bd436 to 293876e Compare August 14, 2025 21:06
@wesleytruong wesleytruong merged commit a59abea into main Aug 15, 2025
5 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants