Hugginface Intro¶
Hugging Face is a company making waves in the technology world with its amazing tools for understanding and using human language in computers. Hugging Face offers everything from tokenizers, which help computers make sense of text, to a huge variety of ready-to-go language models, and even a treasure trove of data suited for language tasks.
Tokenizers¶
These work like a translator, converting the words we use into smaller parts and creating a secret code that computers can understand and work with.
HuggingFace tokenizers help us break down text into smaller, manageable pieces called tokens. These tokenizers are easy to use and also remarkably fast due to their use of the Rust programming language.
Tokenization: It's like cutting a sentence into individual pieces, such as words or characters, to make it easier to analyze.
Tokens: These are the pieces you get after cutting up text during tokenization, kind of like individual Lego blocks that can be words, parts of words, or even single letters. These tokens are converted to numerical values for models to understand.
Pre-trained Model: This is a ready-made model that has been previously taught with a lot of data.
Uncased: This means that the model treats uppercase and lowercase letters as the same.
from transformers import BertTokenizer
# Initialize the tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
# See how many tokens are in the vocabulary
tokenizer.vocab_size
/Users/diegofernandezgil/projects/personal-page/.venv/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm tokenizer_config.json: 100%|██████████| 28.0/28.0 [00:00<00:00, 6.83kB/s] vocab.txt: 100%|██████████| 232k/232k [00:00<00:00, 1.43MB/s] tokenizer.json: 100%|██████████| 466k/466k [00:00<00:00, 1.46MB/s] config.json: 100%|██████████| 570/570 [00:00<00:00, 1.49MB/s]
30522
# Tokenize the sentence
tokens = tokenizer.tokenize("I heart Generative AI")
# Print the tokens
print(tokens)
# ['i', 'heart', 'genera', '##tive', 'ai']
# Show the token ids assigned to each token
print(tokenizer.convert_tokens_to_ids(tokens))
# [1045, 2540, 11416, 6024, 9932]
['i', 'heart', 'genera', '##tive', 'ai'] [1045, 2540, 11416, 6024, 9932]
Resources¶
Models¶
These are like the brain for computers, allowing them to learn and make decisions based on information they've been fed.
Hugging Face models provide a quick way to get started using models trained by the community. With only a few lines of code, you can load a pre-trained model and start using it on tasks such as sentiment analysis.
from transformers import BertForSequenceClassification, BertTokenizer
import torch
# Load a pre-trained sentiment analysis model
model_name = "textattack/bert-base-uncased-imdb"
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=2)
# Tokenize the input sequence
tokenizer = BertTokenizer.from_pretrained(model_name)
inputs = tokenizer("I love Generative AI", return_tensors="pt")
# Make prediction
with torch.no_grad():
outputs = model(**inputs).logits
probabilities = torch.nn.functional.softmax(outputs, dim=1)
predicted_class = torch.argmax(probabilities)
# Display sentiment result
if predicted_class == 1:
print(f"Sentiment: Positive ({probabilities[0][1] * 100:.2f}%)")
else:
print(f"Sentiment: Negative ({probabilities[0][0] * 100:.2f}%)")
config.json: 100%|██████████| 511/511 [00:00<00:00, 1.32MB/s] pytorch_model.bin: 100%|██████████| 438M/438M [00:17<00:00, 25.6MB/s] tokenizer_config.json: 100%|██████████| 48.0/48.0 [00:00<00:00, 111kB/s] vocab.txt: 100%|██████████| 232k/232k [00:00<00:00, 1.53MB/s] special_tokens_map.json: 100%|██████████| 112/112 [00:00<00:00, 465kB/s]
Sentiment: Positive (88.68%)
Resources¶
Datasets¶
Think of datasets as textbooks for computer models. They are collections of information that models study to learn and improve.
HuggingFace Datasets library is a powerful tool for managing a variety of data types, like text and images, efficiently and easily. This resource is incredibly fast and doesn't use a lot of computer memory, making it great for handling big projects without any hassle.
from datasets import load_dataset
from IPython.display import HTML, display
# Load the IMDB dataset, which contains movie reviews
# and sentiment labels (positive or negative)
dataset = load_dataset("imdb")
# Fetch a revie from the training set
review_number = 42
sample_review = dataset["train"][review_number]
/Users/diegofernandezgil/projects/personal-page/.venv/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm Downloading readme: 100%|██████████| 7.81k/7.81k [00:00<00:00, 4.79MB/s] Downloading data: 100%|██████████| 21.0M/21.0M [00:01<00:00, 13.9MB/s] Downloading data: 100%|██████████| 20.5M/20.5M [00:01<00:00, 14.5MB/s] Downloading data: 100%|██████████| 42.0M/42.0M [00:03<00:00, 13.8MB/s] Generating train split: 100%|██████████| 25000/25000 [00:00<00:00, 113012.22 examples/s] Generating test split: 100%|██████████| 25000/25000 [00:00<00:00, 91629.83 examples/s] Generating unsupervised split: 100%|██████████| 50000/50000 [00:00<00:00, 76244.49 examples/s]
display(HTML(sample_review["text"][:450] + "..."))
With a cast like this, you wonder whether or not the actors and actresses knew exactly what they were getting into. Did they see the script and say, `Hey, Close Encounters of the Third Kind was such a hit that this one can't fail.' Unfortunately, it does. Did they even think to check on the director's credentials...
if sample_review["label"] == 1:
print("Sentiment: Positive")
else:
print("Sentiment: Negative")
Sentiment: Negative
Resources¶
Trainers¶
Trainers are the coaches for computer models. They help these models get better at their tasks by practicing and providing guidance. HuggingFace Trainers implement the PyTorch training loop for you, so you can focus instead on other aspects of working on the model.
Hugging Face trainers offer a simplified approach to training generative AI models, making it easier to set up and run complex machine learning tasks. This tool wraps up the hard parts, like handling data and carrying out the training process, allowing us to focus on the big picture and achieve better outcomes with our AI endeavors.
Truncating: This refers to shortening longer pieces of text to fit a certain size limit.
Padding: Adding extra data to shorter texts to reach a uniform length for processing.
Batches: Batches are small, evenly divided parts of data that the AI looks at and learns from each step of the way.
Batch Size: The number of data samples that the machine considers in one go during training.
Epochs: A complete pass through the entire training dataset. The more epochs, the more the computer goes over the material to learn.
Dataset Splits: Dividing the dataset into parts for different uses, such as training the model and testing how well it works.
from transformers import (
DistilBertForSequenceClassification,
DistilBertTokenizer,
TrainingArguments,
Trainer,
)
from datasets import load_dataset
model = DistilBertForSequenceClassification.from_pretrained(
"distilbert-base-uncased", num_labels=2
)
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
def tokenize_function(examples):
return tokenizer(examples["text"], padding="max_length", truncation=True)
dataset = load_dataset("imdb")
tokenized_datasets = dataset.map(tokenize_function, batched=True)
config.json: 100%|██████████| 483/483 [00:00<00:00, 2.05MB/s] model.safetensors: 100%|██████████| 268M/268M [00:05<00:00, 47.7MB/s] Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. tokenizer_config.json: 100%|██████████| 28.0/28.0 [00:00<00:00, 12.7kB/s] vocab.txt: 100%|██████████| 232k/232k [00:00<00:00, 2.86MB/s] tokenizer.json: 100%|██████████| 466k/466k [00:00<00:00, 1.56MB/s] Map: 100%|██████████| 25000/25000 [04:05<00:00, 101.86 examples/s] Map: 100%|██████████| 25000/25000 [03:07<00:00, 133.61 examples/s] Map: 100%|██████████| 50000/50000 [06:11<00:00, 134.42 examples/s]
training_args = TrainingArguments(
per_device_train_batch_size=64,
output_dir="./results",
learning_rate=2e-5,
num_train_epochs=3,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["test"],
)
trainer.train()
Resources¶
Created: 2024-10-23