-
Notifications
You must be signed in to change notification settings - Fork 477
added better guidance for if deprecated tokenizer path fails #1568
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM but i have some suggestions.
torchtitan/components/tokenizer.py
Outdated
if "assets/tokenizer" in tokenizer_path: | ||
raise FileNotFoundError( | ||
"Detected deprecated ./assets/tokenizer path. Remove --model.tokenizer_path " | ||
"and download to --model.hf_assets_path using ./scripts/download_hf_assets.py" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest we add link https://github.com/pytorch/torchtitan/tree/main/torchtitan/models/deepseek_v3#download-tokenizer for better user guidance.
torchtitan/components/tokenizer.py
Outdated
|
||
if "assets/tokenizer" in tokenizer_path: | ||
raise FileNotFoundError( | ||
"Detected deprecated ./assets/tokenizer path. Remove --model.tokenizer_path " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about adding more info about path deprecation:
"detected ... path that's deprecated in #1526"
torchtitan/components/tokenizer.py
Outdated
if "assets/tokenizer" in tokenizer_path: | ||
raise FileNotFoundError( | ||
"Detected deprecated ./assets/tokenizer path. Remove --model.tokenizer_path " | ||
"and download to --model.hf_assets_path using ./scripts/download_hf_assets.py" | ||
) | ||
else: | ||
raise FileNotFoundError( | ||
f"No supported tokenizer files found in '{tokenizer_path}'. " | ||
f"Available files: {available_files}. " | ||
"Looking for: tokenizer.json, vocab.txt+merges.txt, or vocab.json+merges.txt" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we don't need to change the check here because the logic was:
- check "tokenizer_path" folder for available tokenizer files.
While the new logic we add is to remind users of the deprecation of old tokenizer path. Since we have verified the correctness of "tokenizer_path", there's no need to perform it again -- if the execution goes to the else branch, that simply means the specified tokenizer file is not available and the existing information is detailed enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the logic is not required as @XilunWu mentioned, please removed it. If the logic is required, this if/else plus the error messages look almost identical to the error handling block above, please consider to make it as a function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we don't need to change the check here because the logic was:
- check "tokenizer_path" folder for available tokenizer files.
While the new logic we add is to remind users of the deprecation of old tokenizer path. Since we have verified the correctness of "tokenizer_path", there's no need to perform it again -- if the execution goes to the else branch, that simply means the specified tokenizer file is not available and the existing information is detailed enough.
Ok I think I'll remove the second check. I added it because I wanted to cover the use case of people who may have this path due to using legacy script but are missing the tokenizer file, but I realize people will probably either not have this path at all or have the path with the files as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stamp to unblock, please address the comments before landing.
torchtitan/components/tokenizer.py
Outdated
|
||
if "assets/tokenizer" in tokenizer_path: | ||
raise FileNotFoundError( | ||
"Detected ./assets/tokenizer path which was deprecated in https://github.com/pytorch/torchtitan/pull/1540. \n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason for the extra space?
b768745
to
58bd436
Compare
58bd436
to
293876e
Compare
Adds a check to see if the old tokenizer path is being used when tokenizer path fails. This way it can provide guidance to people to update to the supported
hf_assets_path
anddownload_hf_assets.py
script