xmodaler.tokenization

class xmodaler.tokenization.BertTokenizer(vocab_file, do_lower_case=True, do_basic_tokenize=True, never_split=None, unk_token='[UNK]', sep_token='[SEP]', pad_token='[PAD]', cls_token='[CLS]', mask_token='[MASK]', tokenize_chinese_chars=True, **kwargs)[source]

Bases: PreTrainedTokenizer

Constructs a BertTokenizer. BertTokenizer runs end-to-end tokenization: punctuation splitting + wordpiece

Parameters:
  • vocab_file – Path to a one-wordpiece-per-line vocabulary file

  • do_lower_case – Whether to lower case the input. Only has an effect when do_wordpiece_only=False

  • do_basic_tokenize – Whether to do basic tokenization before wordpiece.

  • max_len – An artificial maximum length to truncate tokenized sequences to; Effective maximum length is always the minimum of this value (if specified) and the underlying BERT model’s sequence length.

  • never_split – List of tokens which will never be split during tokenization. Only has an effect when do_wordpiece_only=False

add_special_tokens_sentences_pair(token_ids_0, token_ids_1)[source]

Adds special tokens to a sequence pair for sequence classification tasks. A BERT sequence pair has the following format: [CLS] A [SEP] B [SEP]

add_special_tokens_single_sentence(token_ids)[source]

Adds special tokens to the a sequence for sequence classification tasks. A BERT sequence has the following format: [CLS] X [SEP]

convert_tokens_to_string(tokens)[source]

Converts a sequence of tokens (string) in a single string.

max_model_input_sizes = {'bert-base-cased': 512, 'bert-base-cased-finetuned-mrpc': 512, 'bert-base-chinese': 512, 'bert-base-german-cased': 512, 'bert-base-multilingual-cased': 512, 'bert-base-multilingual-uncased': 512, 'bert-base-uncased': 512, 'bert-large-cased': 512, 'bert-large-cased-whole-word-masking': 512, 'bert-large-cased-whole-word-masking-finetuned-squad': 512, 'bert-large-uncased': 512, 'bert-large-uncased-whole-word-masking': 512, 'bert-large-uncased-whole-word-masking-finetuned-squad': 512}
pretrained_init_configuration = {'bert-base-cased': {'do_lower_case': False}, 'bert-base-cased-finetuned-mrpc': {'do_lower_case': False}, 'bert-base-chinese': {'do_lower_case': False}, 'bert-base-german-cased': {'do_lower_case': False}, 'bert-base-multilingual-cased': {'do_lower_case': False}, 'bert-base-multilingual-uncased': {'do_lower_case': True}, 'bert-base-uncased': {'do_lower_case': True}, 'bert-large-cased': {'do_lower_case': False}, 'bert-large-cased-whole-word-masking': {'do_lower_case': False}, 'bert-large-cased-whole-word-masking-finetuned-squad': {'do_lower_case': False}, 'bert-large-uncased': {'do_lower_case': True}, 'bert-large-uncased-whole-word-masking': {'do_lower_case': True}, 'bert-large-uncased-whole-word-masking-finetuned-squad': {'do_lower_case': True}}
pretrained_vocab_files_map = {'vocab_file': {'bert-base-cased': 'https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-vocab.txt', 'bert-base-cased-finetuned-mrpc': 'https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-finetuned-mrpc-vocab.txt', 'bert-base-chinese': 'https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-chinese-vocab.txt', 'bert-base-german-cased': 'https://int-deepset-models-bert.s3.eu-central-1.amazonaws.com/pytorch/bert-base-german-cased-vocab.txt', 'bert-base-multilingual-cased': 'https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-cased-vocab.txt', 'bert-base-multilingual-uncased': 'https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-uncased-vocab.txt', 'bert-base-uncased': '../pretrain/BERT/bert-base-uncased-vocab.txt', 'bert-large-cased': 'https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-vocab.txt', 'bert-large-cased-whole-word-masking': 'https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-whole-word-masking-vocab.txt', 'bert-large-cased-whole-word-masking-finetuned-squad': 'https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-whole-word-masking-finetuned-squad-vocab.txt', 'bert-large-uncased': 'https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt', 'bert-large-uncased-whole-word-masking': 'https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-whole-word-masking-vocab.txt', 'bert-large-uncased-whole-word-masking-finetuned-squad': 'https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-whole-word-masking-finetuned-squad-vocab.txt'}}
save_vocabulary(vocab_path)[source]

Save the tokenizer vocabulary to a directory or file.

vocab_files_names = {'vocab_file': 'vocab.txt'}
property vocab_size

Size of the base vocabulary (without the added tokens)