Welcome toVigges Developer Community-Open, Learning,Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.4k views
in Technique[技术] by (71.8m points)

pytorch - Is there a way to use GPU instead of CPU for BERT tokenization?

I'm using a BERT tokenizer over a large dataset of sentences (2.3M lines, 6.53bn words):

#creating a BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', 
                                          do_lower_case=True)

#encoding the data using our tokenizer
encoded_dict = tokenizer.batch_encode_plus(
    df[df.data_type=='train'].comment.values, 
    add_special_tokens=True, 
    return_attention_mask=True, 
    pad_to_max_length=True, 
    max_length=256, 
    return_tensors='pt'
)

As-is, it runs on CPU and only on 1 core. I tried to parallelize, but that will only speed up processing by 16x with my 16 cores CPU, which will still make it run for ages if I want to tokenize the full dataset.

Is there any way to make it run on GPU or to speed this up some other way?

EDIT: I have also tried using a fast tokenizer:

#creating a BERT tokenizer
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased', 
                                          do_lower_case=True)

Then passing the output to my batch_encode_plus:

#encoding the data using our tokenizer
encoded_dict = tokenizer.batch_encode_plus(
    df[df.data_type=='train'].comment.values, 
    add_special_tokens=True, 
    return_attention_mask=True, 
    pad_to_max_length=True, 
    max_length=256, 
    return_tensors='pt'
)

But the batch_encode_plus returns the following error:

TypeError: batch_text_or_text_pairs has to be a list (got <class 'numpy.ndarray'>)

question from:https://stackoverflow.com/questions/65857708/is-there-a-way-to-use-gpu-instead-of-cpu-for-bert-tokenization

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)
Waitting for answers

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to Vigges Developer Community for programmer and developer-Open, Learning and Share
...