In [1]:

  Copied!     
 
from transformers import AutoTokenizer
from transformers import AutoTokenizer

/Users/diegofernandezgil/projects/personal-page/.venv/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

Using Hugging Face Tokenizers¶

Loading Tokenizer¶

In this notebook, we'll explore Hugging Face's tokenizers by using a pretrained model. Hugging Face has many tokenizers available that have already been trained for specific models and tasks!

In [2]:

  Copied!     
 
# Choose a pretrained tokenizer to use
my_tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
# Choose a pretrained tokenizer to use my_tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

Encoding: Text to Tokens¶

Tokens: String Representations¶

In [3]:

  Copied!     
 
# Simple method getting tokens from text
raw_text = '''Rory's shoes are magenta and so are Corey's but they aren't nearly as dark!'''
tokens = my_tokenizer.tokenize(raw_text)

print(tokens)
# Simple method getting tokens from text raw_text = '''Rory's shoes are magenta and so are Corey's but they aren't nearly as dark!''' tokens = my_tokenizer.tokenize(raw_text) print(tokens)

['Rory', "'", 's', 'shoes', 'are', 'mage', '##nta', 'and', 'so', 'are', 'Corey', "'", 's', 'but', 'they', 'aren', "'", 't', 'nearly', 'as', 'dark', '!']

In [4]:

  Copied!     
 
# This method also returns special tokens depending on the pretrained tokenizer
detailed_tokens = my_tokenizer(raw_text).tokens()

print(detailed_tokens)
# This method also returns special tokens depending on the pretrained tokenizer detailed_tokens = my_tokenizer(raw_text).tokens() print(detailed_tokens)

['[CLS]', 'Rory', "'", 's', 'shoes', 'are', 'mage', '##nta', 'and', 'so', 'are', 'Corey', "'", 's', 'but', 'they', 'aren', "'", 't', 'nearly', 'as', 'dark', '!', '[SEP]']

Tokens: Integer ID Representations¶

In [5]:

  Copied!     
 
# Way to get tokens as integer IDs
print(my_tokenizer.encode(raw_text))
# Way to get tokens as integer IDs print(my_tokenizer.encode(raw_text))

[101, 14845, 112, 188, 5743, 1132, 27595, 13130, 1105, 1177, 1132, 19521, 112, 188, 1133, 1152, 4597, 112, 189, 2212, 1112, 1843, 106, 102]

In [6]:

  Copied!     
 
print(detailed_tokens)

# Tokenizer method to get the IDs if we already have the tokens as strings
detailed_ids = my_tokenizer.convert_tokens_to_ids(detailed_tokens)
print(detailed_ids)
print(detailed_tokens) # Tokenizer method to get the IDs if we already have the tokens as strings detailed_ids = my_tokenizer.convert_tokens_to_ids(detailed_tokens) print(detailed_ids)

['[CLS]', 'Rory', "'", 's', 'shoes', 'are', 'mage', '##nta', 'and', 'so', 'are', 'Corey', "'", 's', 'but', 'they', 'aren', "'", 't', 'nearly', 'as', 'dark', '!', '[SEP]']
[101, 14845, 112, 188, 5743, 1132, 27595, 13130, 1105, 1177, 1132, 19521, 112, 188, 1133, 1152, 4597, 112, 189, 2212, 1112, 1843, 106, 102]

Another way can look a little complex but can be useful when working with tokenizers for certain tasks.

In [7]:

  Copied!     
 
# Returns an object that has a few different keys available
my_tokenizer(raw_text)
# Returns an object that has a few different keys available my_tokenizer(raw_text)

Out[7]:

{'input_ids': [101, 14845, 112, 188, 5743, 1132, 27595, 13130, 1105, 1177, 1132, 19521, 112, 188, 1133, 1152, 4597, 112, 189, 2212, 1112, 1843, 106, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [8]:

  Copied!     
 
# focus on `input_ids` which are the IDs associated with the tokens.
print(my_tokenizer(raw_text).input_ids)
# focus on `input_ids` which are the IDs associated with the tokens. print(my_tokenizer(raw_text).input_ids)

[101, 14845, 112, 188, 5743, 1132, 27595, 13130, 1105, 1177, 1132, 19521, 112, 188, 1133, 1152, 4597, 112, 189, 2212, 1112, 1843, 106, 102]

Decoding: Tokens to Text¶

We of course can use the tokenizer to go from token IDs to tokens and back to text!

In [9]:

  Copied!     
 
# Integer IDs for tokens
ids = my_tokenizer.encode(raw_text)

# The inverse of the .enocde() method: .decode()
my_tokenizer.decode(ids)
# Integer IDs for tokens ids = my_tokenizer.encode(raw_text) # The inverse of the .enocde() method: .decode() my_tokenizer.decode(ids)

Out[9]:

"[CLS] Rory's shoes are magenta and so are Corey's but they aren't nearly as dark! [SEP]"

In [10]:

  Copied!     
 
# To ignore special tokens (depending on pretrained tokenizer)
my_tokenizer.decode(ids, skip_special_tokens=True)
# To ignore special tokens (depending on pretrained tokenizer) my_tokenizer.decode(ids, skip_special_tokens=True)

Out[10]:

"Rory's shoes are magenta and so are Corey's but they aren't nearly as dark!"

In [11]:

  Copied!     
 
# List of tokens as strings instead of one long string
my_tokenizer.convert_ids_to_tokens(ids)
# List of tokens as strings instead of one long string my_tokenizer.convert_ids_to_tokens(ids)

Out[11]:

['[CLS]',
 'Rory',
 "'",
 's',
 'shoes',
 'are',
 'mage',
 '##nta',
 'and',
 'so',
 'are',
 'Corey',
 "'",
 's',
 'but',
 'they',
 'aren',
 "'",
 't',
 'nearly',
 'as',
 'dark',
 '!',
 '[SEP]']

A Note on the Unknown¶

One thing to consider is if a string is outside of the tokenizer's vocabulary, also known as an "unkown" token.

They are typically represented with [UNK] or some other similar variant.

In [12]:

  Copied!     
 
phrase = '🥱 the dog next door kept barking all night!!'
ids = my_tokenizer.encode(phrase)
print(phrase)
print(my_tokenizer.convert_ids_to_tokens(ids))
print(my_tokenizer.decode(ids))
phrase = '🥱 the dog next door kept barking all night!!' ids = my_tokenizer.encode(phrase) print(phrase) print(my_tokenizer.convert_ids_to_tokens(ids)) print(my_tokenizer.decode(ids))

🥱 the dog next door kept barking all night!!
['[CLS]', '[UNK]', 'the', 'dog', 'next', 'door', 'kept', 'barking', 'all', 'night', '!', '!', '[SEP]']
[CLS] [UNK] the dog next door kept barking all night!! [SEP]

In [13]:

  Copied!     
 
phrase = '''wow my dad thought mcdonalds sold tacos \N{SKULL}'''
ids = my_tokenizer.encode(phrase)
print(phrase)
print(my_tokenizer.convert_ids_to_tokens(ids))
print(my_tokenizer.decode(ids))
phrase = '''wow my dad thought mcdonalds sold tacos \N{SKULL}''' ids = my_tokenizer.encode(phrase) print(phrase) print(my_tokenizer.convert_ids_to_tokens(ids)) print(my_tokenizer.decode(ids))

wow my dad thought mcdonalds sold tacos 💀
['[CLS]', 'w', '##ow', 'my', 'dad', 'thought', 'm', '##c', '##don', '##ald', '##s', 'sold', 'ta', '##cos', '[UNK]', '[SEP]']
[CLS] wow my dad thought mcdonalds sold tacos [UNK] [SEP]

More Properties of Hugging Face's Tokenizers¶

We'll load a couple different models:

bert-base-cased (doc)
xlm-roberta-base (doc)
google/pegasus-xsum (doc)
allenai/longformer-base-4096 (doc)

In [14]:

  Copied!     
 
model_names = (
    "bert-base-cased",
    "xlm-roberta-base",
    "google/pegasus-xsum",
    "allenai/longformer-base-4096",
)

model_tokenizers = {
    model_name: AutoTokenizer.from_pretrained(model_name) for model_name in model_names
}
model_names = ( "bert-base-cased", "xlm-roberta-base", "google/pegasus-xsum", "allenai/longformer-base-4096", ) model_tokenizers = { model_name: AutoTokenizer.from_pretrained(model_name) for model_name in model_names }

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
config.json: 100%|██████████| 615/615 [00:00<00:00, 1.03MB/s]
sentencepiece.bpe.model: 100%|██████████| 5.07M/5.07M [00:01<00:00, 3.49MB/s]
tokenizer.json: 100%|██████████| 9.10M/9.10M [00:00<00:00, 14.6MB/s]
tokenizer_config.json: 100%|██████████| 87.0/87.0 [00:00<00:00, 176kB/s]
config.json: 100%|██████████| 1.39k/1.39k [00:00<00:00, 4.83MB/s]
spiece.model: 100%|██████████| 1.91M/1.91M [00:00<00:00, 5.35MB/s]
tokenizer.json: 100%|██████████| 3.52M/3.52M [00:00<00:00, 26.7MB/s]
special_tokens_map.json: 100%|██████████| 65.0/65.0 [00:00<00:00, 82.0kB/s]
config.json: 100%|██████████| 694/694 [00:00<00:00, 1.88MB/s]
vocab.json: 100%|██████████| 899k/899k [00:00<00:00, 2.36MB/s]
merges.txt: 100%|██████████| 456k/456k [00:00<00:00, 231MB/s]
tokenizer.json: 100%|██████████| 1.36M/1.36M [00:00<00:00, 49.6MB/s]

`model_max_length`¶

In [15]:

  Copied!     
 
for model_name, temp_tokenizer in model_tokenizers.items():
    max_length = temp_tokenizer.model_max_length
    print(f"{model_name}\n\tmax length: {max_length}")
    print("\n")
for model_name, temp_tokenizer in model_tokenizers.items(): max_length = temp_tokenizer.model_max_length print(f"{model_name}\n\tmax length: {max_length}") print("\n")

bert-base-cased
	max length: 512


xlm-roberta-base
	max length: 512


google/pegasus-xsum
	max length: 512


allenai/longformer-base-4096
	max length: 4096

Special Tokens¶

We've already mentioned special tokens like the "unknown" token. Different models use different ways to distinguish special tokens and not all models cover all the special tokens since it's dependent on the model's task it was trained for.

In [16]:

  Copied!     
 
for model_name, temp_tokenizer in model_tokenizers.items():
    special_tokens = temp_tokenizer.all_special_tokens
    print(f"{model_name}\n\tspecial tokens: {special_tokens}")
    print("\n")
for model_name, temp_tokenizer in model_tokenizers.items(): special_tokens = temp_tokenizer.all_special_tokens print(f"{model_name}\n\tspecial tokens: {special_tokens}") print("\n")

bert-base-cased
	special tokens: ['[UNK]', '[SEP]', '[PAD]', '[CLS]', '[MASK]']


xlm-roberta-base
	special tokens: ['<s>', '</s>', '<unk>', '<pad>', '<mask>']


google/pegasus-xsum
	special tokens: ['</s>', '<unk>', '<pad>', '<mask_2>', '<mask_1>', '<unk_2>', '<unk_3>', '<unk_4>', '<unk_5>', '<unk_6>', '<unk_7>', '<unk_8>', '<unk_9>', '<unk_10>', '<unk_11>', '<unk_12>', '<unk_13>', '<unk_14>', '<unk_15>', '<unk_16>', '<unk_17>', '<unk_18>', '<unk_19>', '<unk_20>', '<unk_21>', '<unk_22>', '<unk_23>', '<unk_24>', '<unk_25>', '<unk_26>', '<unk_27>', '<unk_28>', '<unk_29>', '<unk_30>', '<unk_31>', '<unk_32>', '<unk_33>', '<unk_34>', '<unk_35>', '<unk_36>', '<unk_37>', '<unk_38>', '<unk_39>', '<unk_40>', '<unk_41>', '<unk_42>', '<unk_43>', '<unk_44>', '<unk_45>', '<unk_46>', '<unk_47>', '<unk_48>', '<unk_49>', '<unk_50>', '<unk_51>', '<unk_52>', '<unk_53>', '<unk_54>', '<unk_55>', '<unk_56>', '<unk_57>', '<unk_58>', '<unk_59>', '<unk_60>', '<unk_61>', '<unk_62>', '<unk_63>', '<unk_64>', '<unk_65>', '<unk_66>', '<unk_67>', '<unk_68>', '<unk_69>', '<unk_70>', '<unk_71>', '<unk_72>', '<unk_73>', '<unk_74>', '<unk_75>', '<unk_76>', '<unk_77>', '<unk_78>', '<unk_79>', '<unk_80>', '<unk_81>', '<unk_82>', '<unk_83>', '<unk_84>', '<unk_85>', '<unk_86>', '<unk_87>', '<unk_88>', '<unk_89>', '<unk_90>', '<unk_91>', '<unk_92>', '<unk_93>', '<unk_94>', '<unk_95>', '<unk_96>', '<unk_97>', '<unk_98>', '<unk_99>', '<unk_100>', '<unk_101>', '<unk_102>']


allenai/longformer-base-4096
	special tokens: ['<s>', '</s>', '<unk>', '<pad>', '<mask>']

You can also call the specific token you're interested in to see its representation.ç

In [17]:

  Copied!     
 
model_tokenizers["bert-base-cased"].unk_token
model_tokenizers["bert-base-cased"].unk_token

Out[17]:

'[UNK]'

In [18]:

  Copied!     
 
for model_name, temp_tokenizer in model_tokenizers.items():
    print(f"{model_name}")
    print(f"\tUnknown: \n\t\t{temp_tokenizer.unk_token=}")
    print(f"\tBeginning of Sequence: \n\t\t{temp_tokenizer.bos_token=}")
    print(f"\tEnd of Sequence: \n\t\t{temp_tokenizer.eos_token=}")
    print(f"\tMask: \n\t\t{temp_tokenizer.mask_token=}")
    print(f"\tSentence Separator: \n\t\t{temp_tokenizer.sep_token=}")
    print(f"\tClass of Input: \n\t\t{temp_tokenizer.cls_token=}")
    print("\n")
for model_name, temp_tokenizer in model_tokenizers.items(): print(f"{model_name}") print(f"\tUnknown: \n\t\t{temp_tokenizer.unk_token=}") print(f"\tBeginning of Sequence: \n\t\t{temp_tokenizer.bos_token=}") print(f"\tEnd of Sequence: \n\t\t{temp_tokenizer.eos_token=}") print(f"\tMask: \n\t\t{temp_tokenizer.mask_token=}") print(f"\tSentence Separator: \n\t\t{temp_tokenizer.sep_token=}") print(f"\tClass of Input: \n\t\t{temp_tokenizer.cls_token=}") print("\n")

bert-base-cased
	Unknown: 
		temp_tokenizer.unk_token='[UNK]'
	Beginning of Sequence: 
		temp_tokenizer.bos_token=None
	End of Sequence: 
		temp_tokenizer.eos_token=None
	Mask: 
		temp_tokenizer.mask_token='[MASK]'
	Sentence Separator: 
		temp_tokenizer.sep_token='[SEP]'
	Class of Input: 
		temp_tokenizer.cls_token='[CLS]'


xlm-roberta-base
	Unknown: 
		temp_tokenizer.unk_token='<unk>'
	Beginning of Sequence: 
		temp_tokenizer.bos_token='<s>'
	End of Sequence: 
		temp_tokenizer.eos_token='</s>'
	Mask: 
		temp_tokenizer.mask_token='<mask>'
	Sentence Separator: 
		temp_tokenizer.sep_token='</s>'
	Class of Input: 
		temp_tokenizer.cls_token='<s>'


google/pegasus-xsum
	Unknown: 
		temp_tokenizer.unk_token='<unk>'
	Beginning of Sequence: 
		temp_tokenizer.bos_token=None
	End of Sequence: 
		temp_tokenizer.eos_token='</s>'
	Mask: 
		temp_tokenizer.mask_token='<mask_2>'
	Sentence Separator: 
		temp_tokenizer.sep_token=None
	Class of Input: 
		temp_tokenizer.cls_token=None


allenai/longformer-base-4096
	Unknown: 
		temp_tokenizer.unk_token='<unk>'
	Beginning of Sequence: 
		temp_tokenizer.bos_token='<s>'
	End of Sequence: 
		temp_tokenizer.eos_token='</s>'
	Mask: 
		temp_tokenizer.mask_token='<mask>'
	Sentence Separator: 
		temp_tokenizer.sep_token='</s>'
	Class of Input: 
		temp_tokenizer.cls_token='<s>'

Different tokenizers will have different special tokens defined. They might have tokens representing:

Unknown token
Beginning of sequence token
Separator token
Token used for padding
Classifier token
Token used for masking values

Additionally, there may be multiple subtypes of each special token. For example, some tokenizers have multiple different unknown tokens (e.g. <unk> and <unk_2>).

Hugging Face Tokenizers Takeaways¶

Different tokenizers can create very different tokens for the same piece of text. When choosing a tokenizer, consider what properties are important to you, such as the maximum length and the special tokens.

If none of the available tokenizers perform the way you need them to, you can also fine-tune a tokenizer to adjust it for your use case.

Documentation on Hugging Face Tokenizers and Models¶

Documentation on some available models:

Last update: 2024-10-23
Created: 2024-10-23