Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ask for the training details about tokenizer #12

Open
SSK-0723 opened this issue Dec 22, 2024 · 8 comments
Open

Ask for the training details about tokenizer #12

SSK-0723 opened this issue Dec 22, 2024 · 8 comments

Comments

@SSK-0723
Copy link

Hello! Thanks for your great job! And I have a question: you say Tokenizer Training process requires a system with more than 2TB of RAM and takes approximately 12 hours for each. But when I reproduced it (by running python ./utils/tokenizer.py --dataset eth --model bpe --metric pixel ), it took only a few minutes to get the result. I wonder if any other operations are needed? thank you!

@InhwanBae
Copy link
Owner

Hi @SSK-0723,

That's interesting. Could you check if the tokens are well-trained? A log file should be saved as a .txt in the same directory. You might also consider updating the tokenizer's training parameters, which could improve results.

spm.SentencePieceTrainer.train(
input=tokenizer_basedir + f"{dataset}-data-8-12-{metric}-multimodal.txt",
model_prefix=tokenizer_basedir + filename,
vocab_size=1224,
unk_id=3,
bos_id=1,
eos_id=2,
pad_id=0,
control_symbols="[PAD],[UNK],[CLS],[SEP],[MASK]",
model_type=modeltype,
train_extremely_large_corpus=True,
# use_all_vocab=True,
character_coverage=1.0, # 0.99995
)

Additionally, it’s possible that the SentencePiece library used for training has been updated to work more efficiently. I haven’t trained it recently, but as long as the tokens are well-defined, there shouldn’t be any issues.

@SSK-0723
Copy link
Author

Thank you for your prompt response!

Unfortunately, I have encountered the same issues while attempting to run the provided tokenizer training code on two different servers. The results were not as expected. Specifically, when training with your preprocessed data(eth-train-8-12-meter-multimodal.json), the resulting vocabulary file trajectoryspiece-meter-bpe_myversion.vocab did not split decimal points correctly. Some randomly selected outputs are as follows:
▁(1.37, -650, ▁(5.34, -651, ▁(3.29, -652 ….

Additionally, I tested the tokenization performance on the test data. The tokenizer you pre-trained achieved excellent results, but the one I trained using the provided code performed noticeably worse. And in addition to the short time, I also have very little RAM usage while training. This leads me to believe there might be an issue with the training process.

By the way, I used sentencepiece version 0.2.0 and transformers version 4.46.3. I am not sure if these versions could affect the training process.

Looking forward to your response!

@InhwanBae
Copy link
Owner

That’s unusual for it to finish so quickly. During my training, I didn’t encounter cases where numbers and parentheses were tokenized at once. Does the coverage in the log file show 1.0?

To improve tokenizer training, leveraging additional synthetic data with random coordinates could be helpful. Using the eth-train-8-12-pixel-multimodal.json file might also provide further improvements. Hope this helps you!

@SSK-0723
Copy link
Author

Yes, character_coverage=1 in my log. I had similar results on both eth-train-8-12-meter-multimodal.json and eth-train-8-12-pixel-multimodal.json.
I would like to confirm if your pre-trained tokenizer trajectoryspiece-meter-bpe was directly obtained by running the command: python./utils/tokenizer.py --dataset eth --model bpe --metric meter. If so, can you share the version of the core package you use? Thanks a lot!

@SSK-0723
Copy link
Author

Hi, I noticed that training seems to work when model=unigram. However, by only changing unigram to bpe, the training ends relatively quickly.

@SSK-0723
Copy link
Author

When choosing unigram as the model type, using my own data to finetune ended quickly and poorly. Look forward to your details.☀

@SuperiorDtj
Copy link

Hi, I encountered a similar issue. When training BPE with ETH data, the RAM usage is around 20GB, and the training completes in just a few minutes. However, with Unigram, I faced the problem of the vocabulary size being too large; the ETH data can use a maximum of 866 tokens.

I noticed that the tokenizer you provided is not partitioned according to the datasets. Should the training use data from all four datasets together, rather than just using one dataset for training as set in the tokenizer.py script?

@InhwanBae
Copy link
Owner

Hi @SSK-0723 and @SuperiorDtj,

To clarify the environment used for training the tokenizer, I’ve conducted some investigations over the past few days. I used sentencepiece==0.1.99 and, due to memory limitations on my Linux server, I trained the tokenizer in a Windows system with 1TB of physical RAM and an 8TB SSD for the page file. Other than this, all settings remained the same.

For a fair comparison, it is necessary to train a separate tokenizer for each scene, and I followed this approach. The reason I provided only one pretrained tokenizer is that the output vocabs turned out to be identical across all scenes in the ETH-UCY dataset. I’ll continue looking into the matter to see if there are any other factors that might be causing issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants