Ask for the training details about tokenizer #12

SSK-0723 · 2024-12-22T05:10:11Z

Hello! Thanks for your great job! And I have a question: you say Tokenizer Training process requires a system with more than 2TB of RAM and takes approximately 12 hours for each. But when I reproduced it (by running python ./utils/tokenizer.py --dataset eth --model bpe --metric pixel ), it took only a few minutes to get the result. I wonder if any other operations are needed? thank you!

InhwanBae · 2024-12-23T08:28:30Z

Hi @SSK-0723,

That's interesting. Could you check if the tokens are well-trained? A log file should be saved as a .txt in the same directory. You might also consider updating the tokenizer's training parameters, which could improve results.

LMTrajectory/utils/tokenizer.py

Lines 33 to 46 in 4723b39

    
           spm.SentencePieceTrainer.train( 
        
               input=tokenizer_basedir + f"{dataset}-data-8-12-{metric}-multimodal.txt", 
        
               model_prefix=tokenizer_basedir + filename, 
        
               vocab_size=1224, 
        
               unk_id=3, 
        
               bos_id=1, 
        
               eos_id=2, 
        
               pad_id=0, 
        
               control_symbols="[PAD],[UNK],[CLS],[SEP],[MASK]", 
        
               model_type=modeltype,   
        
               train_extremely_large_corpus=True, 
        
               # use_all_vocab=True, 
        
               character_coverage=1.0,  # 0.99995 
        
           )

Additionally, it’s possible that the SentencePiece library used for training has been updated to work more efficiently. I haven’t trained it recently, but as long as the tokens are well-defined, there shouldn’t be any issues.

SSK-0723 · 2024-12-23T14:32:40Z

Thank you for your prompt response!

Unfortunately, I have encountered the same issues while attempting to run the provided tokenizer training code on two different servers. The results were not as expected. Specifically, when training with your preprocessed data(eth-train-8-12-meter-multimodal.json), the resulting vocabulary file trajectoryspiece-meter-bpe_myversion.vocab did not split decimal points correctly. Some randomly selected outputs are as follows:
▁(1.37, -650, ▁(5.34, -651, ▁(3.29, -652 ….

Additionally, I tested the tokenization performance on the test data. The tokenizer you pre-trained achieved excellent results, but the one I trained using the provided code performed noticeably worse. And in addition to the short time, I also have very little RAM usage while training. This leads me to believe there might be an issue with the training process.

By the way, I used sentencepiece version 0.2.0 and transformers version 4.46.3. I am not sure if these versions could affect the training process.

Looking forward to your response!

InhwanBae · 2024-12-24T06:26:56Z

That’s unusual for it to finish so quickly. During my training, I didn’t encounter cases where numbers and parentheses were tokenized at once. Does the coverage in the log file show 1.0?

To improve tokenizer training, leveraging additional synthetic data with random coordinates could be helpful. Using the eth-train-8-12-pixel-multimodal.json file might also provide further improvements. Hope this helps you!

SSK-0723 · 2024-12-24T07:03:18Z

Yes, character_coverage=1 in my log. I had similar results on both eth-train-8-12-meter-multimodal.json and eth-train-8-12-pixel-multimodal.json.
I would like to confirm if your pre-trained tokenizer trajectoryspiece-meter-bpe was directly obtained by running the command: python./utils/tokenizer.py --dataset eth --model bpe --metric meter. If so, can you share the version of the core package you use? Thanks a lot!

SSK-0723 · 2024-12-25T11:43:18Z

Hi, I noticed that training seems to work when model=unigram. However, by only changing unigram to bpe, the training ends relatively quickly.

SSK-0723 · 2024-12-30T04:54:59Z

When choosing unigram as the model type, using my own data to finetune ended quickly and poorly. Look forward to your details.☀

SuperiorDtj · 2024-12-31T02:48:00Z

Hi, I encountered a similar issue. When training BPE with ETH data, the RAM usage is around 20GB, and the training completes in just a few minutes. However, with Unigram, I faced the problem of the vocabulary size being too large; the ETH data can use a maximum of 866 tokens.

I noticed that the tokenizer you provided is not partitioned according to the datasets. Should the training use data from all four datasets together, rather than just using one dataset for training as set in the tokenizer.py script?

InhwanBae · 2025-01-02T13:26:22Z

Hi @SSK-0723 and @SuperiorDtj,

To clarify the environment used for training the tokenizer, I’ve conducted some investigations over the past few days. I used sentencepiece==0.1.99 and, due to memory limitations on my Linux server, I trained the tokenizer in a Windows system with 1TB of physical RAM and an 8TB SSD for the page file. Other than this, all settings remained the same.

For a fair comparison, it is necessary to train a separate tokenizer for each scene, and I followed this approach. The reason I provided only one pretrained tokenizer is that the output vocabs turned out to be identical across all scenes in the ETH-UCY dataset. I’ll continue looking into the matter to see if there are any other factors that might be causing issues.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ask for the training details about tokenizer #12

Ask for the training details about tokenizer #12

SSK-0723 commented Dec 22, 2024

InhwanBae commented Dec 23, 2024

SSK-0723 commented Dec 23, 2024

InhwanBae commented Dec 24, 2024

SSK-0723 commented Dec 24, 2024

SSK-0723 commented Dec 25, 2024

SSK-0723 commented Dec 30, 2024

SuperiorDtj commented Dec 31, 2024

InhwanBae commented Jan 2, 2025

Ask for the training details about tokenizer #12

Ask for the training details about tokenizer #12

Comments

SSK-0723 commented Dec 22, 2024

InhwanBae commented Dec 23, 2024

SSK-0723 commented Dec 23, 2024

InhwanBae commented Dec 24, 2024

SSK-0723 commented Dec 24, 2024

SSK-0723 commented Dec 25, 2024

SSK-0723 commented Dec 30, 2024

SuperiorDtj commented Dec 31, 2024

InhwanBae commented Jan 2, 2025