Welcome toVigges Developer Community-Open, Learning,Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
2.2k views
in Technique[技术] by (71.8m points)

huggingface transformers - IndexError: index out of range in self while try to fine tune Roberta model after adding special tokens

I am trying to fine tune a Roberta model after adding some special tokens to its tokenizer:

    special_tokens_dict = {'additional_special_tokens': ['[Tok1]','[Tok2]']}

    tokenizer.add_special_tokens(special_tokens_dict)

I get this error when i try to train the model (on cpu):

IndexError                                Traceback (most recent call last)
<ipython-input-75-d63f8d3c6c67> in <module>()
     50         l = model(b_input_ids, 
     51                      attention_mask=b_input_mask,
---> 52                     labels=b_labels)
     53         loss,logits = l
     54         total_train_loss += l[0].item()

8 frames
/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
   1850         # remove once script supports set_grad_enabled
   1851         _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 1852     return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
   1853 
   1854 

IndexError: index out of range in self

p.s. If I comment add_special_tokens the code works.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

You also need to tell your model that it needs to learn the vector representations of two new tokens:

from transformers import RobertaTokenizer, RobertaForQuestionAnswering
t = RobertaTokenizer.from_pretrained('roberta-base')
m = RobertaForQuestionAnswering.from_pretrained('roberta-base')
#roberta-base 'knows' 50265 tokens
print(m.roberta.embeddings.word_embeddings)

special_tokens_dict = {'additional_special_tokens': ['[Tok1]','[Tok2]']}
t.add_special_tokens(special_tokens_dict)
#we now tell the model that it needs to learn new tokens:
m.resize_token_embeddings(len(t))
m.roberta.embeddings.word_embeddings.padding_idx=1
print(m.roberta.embeddings.word_embeddings)

Output:

Embedding(50265, 768, padding_idx=1)
Embedding(50267, 768, padding_idx=1)

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to Vigges Developer Community for programmer and developer-Open, Learning and Share
...