-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ask for the training details about tokenizer #12
Comments
Hi @SSK-0723, That's interesting. Could you check if the tokens are well-trained? A log file should be saved as a LMTrajectory/utils/tokenizer.py Lines 33 to 46 in 4723b39
Additionally, it’s possible that the SentencePiece library used for training has been updated to work more efficiently. I haven’t trained it recently, but as long as the tokens are well-defined, there shouldn’t be any issues. |
Thank you for your prompt response! Unfortunately, I have encountered the same issues while attempting to run the provided tokenizer training code on two different servers. The results were not as expected. Specifically, when training with your preprocessed data( Additionally, I tested the tokenization performance on the test data. The tokenizer you pre-trained achieved excellent results, but the one I trained using the provided code performed noticeably worse. And in addition to the short time, I also have very little RAM usage while training. This leads me to believe there might be an issue with the training process. By the way, I used Looking forward to your response! |
That’s unusual for it to finish so quickly. During my training, I didn’t encounter cases where numbers and parentheses were tokenized at once. Does the coverage in the log file show 1.0? To improve tokenizer training, leveraging additional synthetic data with random coordinates could be helpful. Using the |
Yes, character_coverage=1 in my log. I had similar results on both |
Hi, I noticed that training seems to work when model=unigram. However, by only changing unigram to bpe, the training ends relatively quickly. |
When choosing unigram as the model type, using my own data to finetune ended quickly and poorly. Look forward to your details.☀ |
Hi, I encountered a similar issue. When training BPE with ETH data, the RAM usage is around 20GB, and the training completes in just a few minutes. However, with Unigram, I faced the problem of the vocabulary size being too large; the ETH data can use a maximum of 866 tokens. I noticed that the tokenizer you provided is not partitioned according to the datasets. Should the training use data from all four datasets together, rather than just using one dataset for training as set in the tokenizer.py script? |
Hi @SSK-0723 and @SuperiorDtj, To clarify the environment used for training the tokenizer, I’ve conducted some investigations over the past few days. I used For a fair comparison, it is necessary to train a separate tokenizer for each scene, and I followed this approach. The reason I provided only one pretrained tokenizer is that the output vocabs turned out to be identical across all scenes in the ETH-UCY dataset. I’ll continue looking into the matter to see if there are any other factors that might be causing issues. |
Hello! Thanks for your great job! And I have a question: you say Tokenizer Training process requires a system with more than 2TB of RAM and takes approximately 12 hours for each. But when I reproduced it (by running python ./utils/tokenizer.py --dataset eth --model bpe --metric pixel ), it took only a few minutes to get the result. I wonder if any other operations are needed? thank you!
The text was updated successfully, but these errors were encountered: