xmodaler.tokenization¶
- class xmodaler.tokenization.BertTokenizer(vocab_file, do_lower_case=True, do_basic_tokenize=True, never_split=None, unk_token='[UNK]', sep_token='[SEP]', pad_token='[PAD]', cls_token='[CLS]', mask_token='[MASK]', tokenize_chinese_chars=True, **kwargs)[source]¶
Bases:
PreTrainedTokenizer
Constructs a BertTokenizer.
BertTokenizer
runs end-to-end tokenization: punctuation splitting + wordpiece- Parameters:
vocab_file – Path to a one-wordpiece-per-line vocabulary file
do_lower_case – Whether to lower case the input. Only has an effect when do_wordpiece_only=False
do_basic_tokenize – Whether to do basic tokenization before wordpiece.
max_len – An artificial maximum length to truncate tokenized sequences to; Effective maximum length is always the minimum of this value (if specified) and the underlying BERT model’s sequence length.
never_split – List of tokens which will never be split during tokenization. Only has an effect when do_wordpiece_only=False
- add_special_tokens_sentences_pair(token_ids_0, token_ids_1)[source]¶
Adds special tokens to a sequence pair for sequence classification tasks. A BERT sequence pair has the following format: [CLS] A [SEP] B [SEP]
- add_special_tokens_single_sentence(token_ids)[source]¶
Adds special tokens to the a sequence for sequence classification tasks. A BERT sequence has the following format: [CLS] X [SEP]
- convert_tokens_to_string(tokens)[source]¶
Converts a sequence of tokens (string) in a single string.
- max_model_input_sizes = {'bert-base-cased': 512, 'bert-base-cased-finetuned-mrpc': 512, 'bert-base-chinese': 512, 'bert-base-german-cased': 512, 'bert-base-multilingual-cased': 512, 'bert-base-multilingual-uncased': 512, 'bert-base-uncased': 512, 'bert-large-cased': 512, 'bert-large-cased-whole-word-masking': 512, 'bert-large-cased-whole-word-masking-finetuned-squad': 512, 'bert-large-uncased': 512, 'bert-large-uncased-whole-word-masking': 512, 'bert-large-uncased-whole-word-masking-finetuned-squad': 512}¶
- pretrained_init_configuration = {'bert-base-cased': {'do_lower_case': False}, 'bert-base-cased-finetuned-mrpc': {'do_lower_case': False}, 'bert-base-chinese': {'do_lower_case': False}, 'bert-base-german-cased': {'do_lower_case': False}, 'bert-base-multilingual-cased': {'do_lower_case': False}, 'bert-base-multilingual-uncased': {'do_lower_case': True}, 'bert-base-uncased': {'do_lower_case': True}, 'bert-large-cased': {'do_lower_case': False}, 'bert-large-cased-whole-word-masking': {'do_lower_case': False}, 'bert-large-cased-whole-word-masking-finetuned-squad': {'do_lower_case': False}, 'bert-large-uncased': {'do_lower_case': True}, 'bert-large-uncased-whole-word-masking': {'do_lower_case': True}, 'bert-large-uncased-whole-word-masking-finetuned-squad': {'do_lower_case': True}}¶
- pretrained_vocab_files_map = {'vocab_file': {'bert-base-cased': 'https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-vocab.txt', 'bert-base-cased-finetuned-mrpc': 'https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-finetuned-mrpc-vocab.txt', 'bert-base-chinese': 'https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-chinese-vocab.txt', 'bert-base-german-cased': 'https://int-deepset-models-bert.s3.eu-central-1.amazonaws.com/pytorch/bert-base-german-cased-vocab.txt', 'bert-base-multilingual-cased': 'https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-cased-vocab.txt', 'bert-base-multilingual-uncased': 'https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-uncased-vocab.txt', 'bert-base-uncased': '../pretrain/BERT/bert-base-uncased-vocab.txt', 'bert-large-cased': 'https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-vocab.txt', 'bert-large-cased-whole-word-masking': 'https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-whole-word-masking-vocab.txt', 'bert-large-cased-whole-word-masking-finetuned-squad': 'https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-whole-word-masking-finetuned-squad-vocab.txt', 'bert-large-uncased': 'https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt', 'bert-large-uncased-whole-word-masking': 'https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-whole-word-masking-vocab.txt', 'bert-large-uncased-whole-word-masking-finetuned-squad': 'https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-whole-word-masking-finetuned-squad-vocab.txt'}}¶
- vocab_files_names = {'vocab_file': 'vocab.txt'}¶
- property vocab_size¶
Size of the base vocabulary (without the added tokens)