模型并行时加载checkpoint导致word embedding size不匹配 #3

Billijk · 2021-08-13T02:49:39Z

作者你好，

我在尝试使用change_mp.py将checkpoint拆分之后使用模型并行，但是在加载模型时提示word embedding大小不匹配。读过代码之后发现代码会在加载模型时将词表大小pad到某个数的整数倍（以提高计算效率），这个数是args.make_vocab_size_divisible_by * mpu.get_model_parallel_world_size()，因此MP改变时词表大小也会改变，导致无法正常加载模型参数。

Chinese-Transformer-XL/pretrain_gpt2.py

Lines 669 to 677 in 0e702e4

    
           before = num_tokens 
        
           after = before 
        
           multiple = args.make_vocab_size_divisible_by * \ 
        
                      mpu.get_model_parallel_world_size() 
        
           while (after % multiple) != 0: 
        
               after += 1 
        
           print_rank_0('> padded vocab (size: {}) with {} dummy ' 
        
                        'tokens (new size: {})'.format( 
        
                            before, after - before, after))

一个temporary fix是将这里671行的multiple变量固定成args.make_vocab_size_divisible_by。

The text was updated successfully, but these errors were encountered:

makeme-zgz · 2021-10-22T07:25:56Z

请问你成功地进行finetune或者pretrain了吗？@Billijk 我这边通过下载链接得到的checkpoint在加载的时候会有runtime error

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

模型并行时加载checkpoint导致word embedding size不匹配 #3

模型并行时加载checkpoint导致word embedding size不匹配 #3

Billijk commented Aug 13, 2021

makeme-zgz commented Oct 22, 2021

模型并行时加载checkpoint导致word embedding size不匹配 #3

模型并行时加载checkpoint导致word embedding size不匹配 #3

Comments

Billijk commented Aug 13, 2021

makeme-zgz commented Oct 22, 2021