from transformers import AutoTokenizer
/Users/diegofernandezgil/projects/personal-page/.venv/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm
Using Hugging Face Tokenizers¶
Loading Tokenizer¶
In this notebook, we'll explore Hugging Face's tokenizers by using a pretrained model. Hugging Face has many tokenizers available that have already been trained for specific models and tasks!
# Choose a pretrained tokenizer to use
my_tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
Encoding: Text to Tokens¶
Tokens: String Representations¶
# Simple method getting tokens from text
raw_text = '''Rory's shoes are magenta and so are Corey's but they aren't nearly as dark!'''
tokens = my_tokenizer.tokenize(raw_text)
print(tokens)
['Rory', "'", 's', 'shoes', 'are', 'mage', '##nta', 'and', 'so', 'are', 'Corey', "'", 's', 'but', 'they', 'aren', "'", 't', 'nearly', 'as', 'dark', '!']
# This method also returns special tokens depending on the pretrained tokenizer
detailed_tokens = my_tokenizer(raw_text).tokens()
print(detailed_tokens)
['[CLS]', 'Rory', "'", 's', 'shoes', 'are', 'mage', '##nta', 'and', 'so', 'are', 'Corey', "'", 's', 'but', 'they', 'aren', "'", 't', 'nearly', 'as', 'dark', '!', '[SEP]']
Tokens: Integer ID Representations¶
# Way to get tokens as integer IDs
print(my_tokenizer.encode(raw_text))
[101, 14845, 112, 188, 5743, 1132, 27595, 13130, 1105, 1177, 1132, 19521, 112, 188, 1133, 1152, 4597, 112, 189, 2212, 1112, 1843, 106, 102]
print(detailed_tokens)
# Tokenizer method to get the IDs if we already have the tokens as strings
detailed_ids = my_tokenizer.convert_tokens_to_ids(detailed_tokens)
print(detailed_ids)
['[CLS]', 'Rory', "'", 's', 'shoes', 'are', 'mage', '##nta', 'and', 'so', 'are', 'Corey', "'", 's', 'but', 'they', 'aren', "'", 't', 'nearly', 'as', 'dark', '!', '[SEP]'] [101, 14845, 112, 188, 5743, 1132, 27595, 13130, 1105, 1177, 1132, 19521, 112, 188, 1133, 1152, 4597, 112, 189, 2212, 1112, 1843, 106, 102]
Another way can look a little complex but can be useful when working with tokenizers for certain tasks.
# Returns an object that has a few different keys available
my_tokenizer(raw_text)
{'input_ids': [101, 14845, 112, 188, 5743, 1132, 27595, 13130, 1105, 1177, 1132, 19521, 112, 188, 1133, 1152, 4597, 112, 189, 2212, 1112, 1843, 106, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
# focus on `input_ids` which are the IDs associated with the tokens.
print(my_tokenizer(raw_text).input_ids)
[101, 14845, 112, 188, 5743, 1132, 27595, 13130, 1105, 1177, 1132, 19521, 112, 188, 1133, 1152, 4597, 112, 189, 2212, 1112, 1843, 106, 102]
Decoding: Tokens to Text¶
We of course can use the tokenizer to go from token IDs to tokens and back to text!
# Integer IDs for tokens
ids = my_tokenizer.encode(raw_text)
# The inverse of the .enocde() method: .decode()
my_tokenizer.decode(ids)
"[CLS] Rory's shoes are magenta and so are Corey's but they aren't nearly as dark! [SEP]"
# To ignore special tokens (depending on pretrained tokenizer)
my_tokenizer.decode(ids, skip_special_tokens=True)
"Rory's shoes are magenta and so are Corey's but they aren't nearly as dark!"
# List of tokens as strings instead of one long string
my_tokenizer.convert_ids_to_tokens(ids)
['[CLS]', 'Rory', "'", 's', 'shoes', 'are', 'mage', '##nta', 'and', 'so', 'are', 'Corey', "'", 's', 'but', 'they', 'aren', "'", 't', 'nearly', 'as', 'dark', '!', '[SEP]']
A Note on the Unknown¶
One thing to consider is if a string is outside of the tokenizer's vocabulary, also known as an "unkown" token.
They are typically represented with
[UNK]
or some other similar variant.
phrase = '🥱 the dog next door kept barking all night!!'
ids = my_tokenizer.encode(phrase)
print(phrase)
print(my_tokenizer.convert_ids_to_tokens(ids))
print(my_tokenizer.decode(ids))
🥱 the dog next door kept barking all night!! ['[CLS]', '[UNK]', 'the', 'dog', 'next', 'door', 'kept', 'barking', 'all', 'night', '!', '!', '[SEP]'] [CLS] [UNK] the dog next door kept barking all night!! [SEP]
phrase = '''wow my dad thought mcdonalds sold tacos \N{SKULL}'''
ids = my_tokenizer.encode(phrase)
print(phrase)
print(my_tokenizer.convert_ids_to_tokens(ids))
print(my_tokenizer.decode(ids))
wow my dad thought mcdonalds sold tacos 💀 ['[CLS]', 'w', '##ow', 'my', 'dad', 'thought', 'm', '##c', '##don', '##ald', '##s', 'sold', 'ta', '##cos', '[UNK]', '[SEP]'] [CLS] wow my dad thought mcdonalds sold tacos [UNK] [SEP]
More Properties of Hugging Face's Tokenizers¶
model_names = (
"bert-base-cased",
"xlm-roberta-base",
"google/pegasus-xsum",
"allenai/longformer-base-4096",
)
model_tokenizers = {
model_name: AutoTokenizer.from_pretrained(model_name) for model_name in model_names
}
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using `tokenizers` before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) config.json: 100%|██████████| 615/615 [00:00<00:00, 1.03MB/s] sentencepiece.bpe.model: 100%|██████████| 5.07M/5.07M [00:01<00:00, 3.49MB/s] tokenizer.json: 100%|██████████| 9.10M/9.10M [00:00<00:00, 14.6MB/s] tokenizer_config.json: 100%|██████████| 87.0/87.0 [00:00<00:00, 176kB/s] config.json: 100%|██████████| 1.39k/1.39k [00:00<00:00, 4.83MB/s] spiece.model: 100%|██████████| 1.91M/1.91M [00:00<00:00, 5.35MB/s] tokenizer.json: 100%|██████████| 3.52M/3.52M [00:00<00:00, 26.7MB/s] special_tokens_map.json: 100%|██████████| 65.0/65.0 [00:00<00:00, 82.0kB/s] config.json: 100%|██████████| 694/694 [00:00<00:00, 1.88MB/s] vocab.json: 100%|██████████| 899k/899k [00:00<00:00, 2.36MB/s] merges.txt: 100%|██████████| 456k/456k [00:00<00:00, 231MB/s] tokenizer.json: 100%|██████████| 1.36M/1.36M [00:00<00:00, 49.6MB/s]
model_max_length
¶
for model_name, temp_tokenizer in model_tokenizers.items():
max_length = temp_tokenizer.model_max_length
print(f"{model_name}\n\tmax length: {max_length}")
print("\n")
bert-base-cased max length: 512 xlm-roberta-base max length: 512 google/pegasus-xsum max length: 512 allenai/longformer-base-4096 max length: 4096
Special Tokens¶
We've already mentioned special tokens like the "unknown" token. Different models use different ways to distinguish special tokens and not all models cover all the special tokens since it's dependent on the model's task it was trained for.
for model_name, temp_tokenizer in model_tokenizers.items():
special_tokens = temp_tokenizer.all_special_tokens
print(f"{model_name}\n\tspecial tokens: {special_tokens}")
print("\n")
bert-base-cased special tokens: ['[UNK]', '[SEP]', '[PAD]', '[CLS]', '[MASK]'] xlm-roberta-base special tokens: ['<s>', '</s>', '<unk>', '<pad>', '<mask>'] google/pegasus-xsum special tokens: ['</s>', '<unk>', '<pad>', '<mask_2>', '<mask_1>', '<unk_2>', '<unk_3>', '<unk_4>', '<unk_5>', '<unk_6>', '<unk_7>', '<unk_8>', '<unk_9>', '<unk_10>', '<unk_11>', '<unk_12>', '<unk_13>', '<unk_14>', '<unk_15>', '<unk_16>', '<unk_17>', '<unk_18>', '<unk_19>', '<unk_20>', '<unk_21>', '<unk_22>', '<unk_23>', '<unk_24>', '<unk_25>', '<unk_26>', '<unk_27>', '<unk_28>', '<unk_29>', '<unk_30>', '<unk_31>', '<unk_32>', '<unk_33>', '<unk_34>', '<unk_35>', '<unk_36>', '<unk_37>', '<unk_38>', '<unk_39>', '<unk_40>', '<unk_41>', '<unk_42>', '<unk_43>', '<unk_44>', '<unk_45>', '<unk_46>', '<unk_47>', '<unk_48>', '<unk_49>', '<unk_50>', '<unk_51>', '<unk_52>', '<unk_53>', '<unk_54>', '<unk_55>', '<unk_56>', '<unk_57>', '<unk_58>', '<unk_59>', '<unk_60>', '<unk_61>', '<unk_62>', '<unk_63>', '<unk_64>', '<unk_65>', '<unk_66>', '<unk_67>', '<unk_68>', '<unk_69>', '<unk_70>', '<unk_71>', '<unk_72>', '<unk_73>', '<unk_74>', '<unk_75>', '<unk_76>', '<unk_77>', '<unk_78>', '<unk_79>', '<unk_80>', '<unk_81>', '<unk_82>', '<unk_83>', '<unk_84>', '<unk_85>', '<unk_86>', '<unk_87>', '<unk_88>', '<unk_89>', '<unk_90>', '<unk_91>', '<unk_92>', '<unk_93>', '<unk_94>', '<unk_95>', '<unk_96>', '<unk_97>', '<unk_98>', '<unk_99>', '<unk_100>', '<unk_101>', '<unk_102>'] allenai/longformer-base-4096 special tokens: ['<s>', '</s>', '<unk>', '<pad>', '<mask>']
You can also call the specific token you're interested in to see its representation.ç
model_tokenizers["bert-base-cased"].unk_token
'[UNK]'
for model_name, temp_tokenizer in model_tokenizers.items():
print(f"{model_name}")
print(f"\tUnknown: \n\t\t{temp_tokenizer.unk_token=}")
print(f"\tBeginning of Sequence: \n\t\t{temp_tokenizer.bos_token=}")
print(f"\tEnd of Sequence: \n\t\t{temp_tokenizer.eos_token=}")
print(f"\tMask: \n\t\t{temp_tokenizer.mask_token=}")
print(f"\tSentence Separator: \n\t\t{temp_tokenizer.sep_token=}")
print(f"\tClass of Input: \n\t\t{temp_tokenizer.cls_token=}")
print("\n")
bert-base-cased Unknown: temp_tokenizer.unk_token='[UNK]' Beginning of Sequence: temp_tokenizer.bos_token=None End of Sequence: temp_tokenizer.eos_token=None Mask: temp_tokenizer.mask_token='[MASK]' Sentence Separator: temp_tokenizer.sep_token='[SEP]' Class of Input: temp_tokenizer.cls_token='[CLS]' xlm-roberta-base Unknown: temp_tokenizer.unk_token='<unk>' Beginning of Sequence: temp_tokenizer.bos_token='<s>' End of Sequence: temp_tokenizer.eos_token='</s>' Mask: temp_tokenizer.mask_token='<mask>' Sentence Separator: temp_tokenizer.sep_token='</s>' Class of Input: temp_tokenizer.cls_token='<s>' google/pegasus-xsum Unknown: temp_tokenizer.unk_token='<unk>' Beginning of Sequence: temp_tokenizer.bos_token=None End of Sequence: temp_tokenizer.eos_token='</s>' Mask: temp_tokenizer.mask_token='<mask_2>' Sentence Separator: temp_tokenizer.sep_token=None Class of Input: temp_tokenizer.cls_token=None allenai/longformer-base-4096 Unknown: temp_tokenizer.unk_token='<unk>' Beginning of Sequence: temp_tokenizer.bos_token='<s>' End of Sequence: temp_tokenizer.eos_token='</s>' Mask: temp_tokenizer.mask_token='<mask>' Sentence Separator: temp_tokenizer.sep_token='</s>' Class of Input: temp_tokenizer.cls_token='<s>'
Different tokenizers will have different special tokens defined. They might have tokens representing:
- Unknown token
- Beginning of sequence token
- Separator token
- Token used for padding
- Classifier token
- Token used for masking values
Additionally, there may be multiple subtypes of each special token. For example, some tokenizers have multiple different unknown tokens (e.g. <unk>
and <unk_2>
).
Hugging Face Tokenizers Takeaways¶
Different tokenizers can create very different tokens for the same piece of text. When choosing a tokenizer, consider what properties are important to you, such as the maximum length and the special tokens.
If none of the available tokenizers perform the way you need them to, you can also fine-tune a tokenizer to adjust it for your use case.
Documentation on Hugging Face Tokenizers and Models¶
Documentation on some available models:
Created: 2024-10-23