Building Question Answering Datasets¶
In this exercise, we want to construct a dataset for training a question answering model. In general, this process is highly manual, requiring the collection of data and, as one might expect, manually constructing pairs of questions and answers that can be found in the text. In our exercise, we provide the data plus question and answer pairs, and work through how to construct the dataset from that starting point.
SQuAD¶
One of the most well studied datasets in question answering is the Stanford Question Answering Dataset (SQuAD), introduced in a paper by Rajpurkar et al. and its follow-up, SQuAD 2.0. Formatting our dataset like SQuAD makes it much easier to use prebuild models and trainers like those from HuggingFace to do our job, so it makes sense to first understand the structure of the dataset. What follows is a single example from the dataset.
{'id': '573387acd058e614000b5cb5',
'title': 'University_of_Notre_Dame',
'context': 'One of the main driving forces in the growth of the University was its football team, the Notre Dame Fighting Irish. Knute Rockne became head coach in 1918. Under Rockne, the Irish would post a record of 105 wins, 12 losses, and five ties. During his 13 years the Irish won three national championships, had five undefeated seasons, won the Rose Bowl in 1925, and produced players such as George Gipp and the "Four Horsemen". Knute Rockne has the highest winning percentage (.881) in NCAA Division I/FBS football history. Rockne\'s offenses employed the Notre Dame Box and his defenses ran a 7–2–2 scheme. The last game Rockne coached was on December 14, 1930 when he led a group of Notre Dame all-stars against the New York Giants in New York City.',
'question': 'In what year did the team lead by Knute Rockne win the Rose Bowl?',
'answers': {'text': ['1925'],
'answer_start': [354]}
}
As we can see, each entry in the dataset is a Python dict
object with the structure:
{'id': str,
'title': str,
'context': str,
'question': str,
'answers': dict
}
For our purposes, we can simply ignore the 'title' entry if we wish -- it has no impact on our ability to use the dataset, but can be important for others, as it maps to the title of the article from which the context is drawn. The answers are themselves a dict with structure:
{'text': list(str),
'answer_start': list(int)
}
This is important since there may be multiple answers in a single context.
Imports¶
Let's start by importing the libraries we're going to need.
from pathlib import Path
import pandas as pd
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, Trainer, default_data_collator, pipeline
from transformers.trainer_utils import PredictionOutput
import math
import time
import collections
import numpy as np
from tqdm.notebook import tqdm
/opt/venv/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm
Building our Dictionaries¶
Our question-answer pairs are stored in a CSV with columns: "question", "answer" and "filename".
df = pd.read_csv("data/qa.csv")
df.head()
question | answer | filename | |
---|---|---|---|
0 | Who is the manufacturer of the product? | Zyxel | CVE-2020-29583.txt |
1 | Who reported the vulnerability? | researchers from EYE Netherlands | CVE-2020-29583.txt |
2 | What is the vulnerability? | A hardcoded credential vulnerability was ident... | CVE-2020-29583.txt |
3 | How do users protect themselves? | we urge users to install the applicable updates | CVE-2020-29583.txt |
4 | What products are affected? | firewalls and AP controllers | CVE-2020-29583.txt |
Given the source of our question-answer pairs, we'll want a function that takes the question, answer, and filepath, and returns a nicely formatted dictionary.
Since we want each question to have a unique identifier, we will also take that as a function argument so we can use an iterator.
Within the function, we also want to read the context from the provided file and locate the starting index of the provided answer. We can do this with the .find()
method of the context string object.
def qa_to_squad(question, answer, filename, identifier):
filepath = "data/" + row['filename']
with open(filepath, "r") as f:
context = f.read()
start_location = context.find(answer)
qa_pair = {
'id': identifier,
'title': filepath,
'context': context,
'question': question,
'answers': {'text': [answer],
'answer_start': [start_location]}
}
return qa_pair
Building Dictionaries from Data¶
Now that we can turn our CSV of questions, answers, and filenames into dictionaries, we'll need to construct a dictionary for each question and answer pair in our file.
Let's use the iterrows()
method of the dataframe to iterate through each question and answer pair, and use the qa_to_squad
function to generate a SQuAD-formatted dictionary. Then we will append the formatted dictionary onto a list. At the end, you will have a list of SQuAD-formatted questions and answers.
In production, you may want to use the uuid
library, or take the hash of the question + context + answers to ensure unique identifiers can be reconstructed. To simplify the process, we can use the index of the row from the iterrows()
method.
qa_list = list()
for i, row in df.iterrows():
q = row['question']
a = row['answer']
f = row['filename']
SQuAD_dict = qa_to_squad(q, a, f, i)
qa_list.append(SQuAD_dict)
Dicts to Datasets¶
Unfortunately, HuggingFace's datasets
library does not have a great way to turn a list of dictionaries into a Dataset. Luckily for us, datasets
plays very nicely with pandas
, which is happy to construct a DataFrame from a list of dictionaries. First, we'll need to turn our list of SQuAD-formatted questions into a DataFrame. Then, we'll need to use the Dataset
.from_pandas
method to create our HuggingFace-friendly dataset!
qa_df = pd.DataFrame(data=qa_list)
data = Dataset.from_pandas(qa_df)
print(data[0])
{'id': 0, 'title': 'data/CVE-2020-29583.txt', 'context': 'CVE: CVE-2020-29583 Summary Zyxel has released a patch for the hardcoded credential vulnerability of firewalls and AP controllers recently reported by researchers from EYE Netherlands. Users are advised to install the applicable firmware updates for optimal protection. What is the vulnerability? A hardcoded credential vulnerability was identified in the “zyfwp” user account in some Zyxel firewalls and AP controllers. The account was designed to deliver automatic firmware updates to connected access points through FTP. What versions are vulnerable—and what should you do? After a thorough investigation, we’ve identified the vulnerable products and are releasing firmware patches to address the issue, as shown in the table below. For optimal protection, we urge users to install the applicable updates. For those not listed, they are not affected. Contact your local Zyxel support team if you require further assistance or visit our forum for more information. Got a question or a tipoff? Please contact your local service rep for further information or assistance. If you’ve found a vulnerability, we want to work with you to fix it—contact security@zyxel.com.tw and we’ll get right back to you. Acknowledgment Thanks to Niels Teusink at EYE for reporting the issue to us. Revision history 2020-12-23: Initial release 2020-12-24: Updated the acknowledgement section 2021-01-04: Updated the patch schedule for AP controllers 2021-01-08: Added the forum link', 'question': 'Who is the manufacturer of the product?', 'answers': {'answer_start': [30], 'text': ['Zyxel']}}
Saving our Dataset¶
Having done all this hard work, the last thing we want to do is have to reconstruct our dataset from scratch at run time. To do this, call the .save_to_disk()
method on your Dataset
object
data.save_to_disk("qa_data.hf")
Fine Tuning BERT on our Data¶
Now that we have our dataset, let's fine-tune a pretrained model on it! This code is written for you, but we're using HuggingFace's transformers
library and the AutoModelForQuestionAnswering
to do the training here.
# Load the tokenizer for DistilBERT
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')
# Load the model
# Note: This will throw warnings, which is expected!
model = AutoModelForQuestionAnswering.from_pretrained('distilbert-base-uncased')
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForQuestionAnswering: ['vocab_projector.weight', 'vocab_projector.bias', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_transform.weight', 'vocab_layer_norm.bias'] - This IS expected if you are initializing DistilBertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing DistilBertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
# The Trainer subclass here is lightly modified from HuggingFace
# Original source at https://github.com/huggingface/transformers/blob/main/examples/pytorch/question-answering/trainer_qa.py
class QuestionAnsweringTrainer(Trainer):
def __init__(self, *args, post_process_function=None, **kwargs):
super().__init__(*args, **kwargs)
self.post_process_function = post_process_function
def predict(self, predict_dataset, predict_examples, ignore_keys=None, metric_key_prefix: str = "test"):
predict_dataloader = self.get_test_dataloader(predict_dataset)
# Temporarily disable metric computation, we will do it in the loop here.
compute_metrics = self.compute_metrics
self.compute_metrics = None
eval_loop = self.prediction_loop if self.args.use_legacy_prediction_loop else self.evaluation_loop
start_time = time.time()
try:
output = eval_loop(
predict_dataloader,
description="Prediction",
# No point gathering the predictions if there are no metrics, otherwise we defer to
# self.args.prediction_loss_only
prediction_loss_only=True if compute_metrics is None else None,
ignore_keys=ignore_keys,
metric_key_prefix=metric_key_prefix,
)
finally:
self.compute_metrics = compute_metrics
total_batch_size = self.args.eval_batch_size * self.args.world_size
if f"{metric_key_prefix}_jit_compilation_time" in output.metrics:
start_time += output.metrics[f"{metric_key_prefix}_jit_compilation_time"]
output.metrics.update(
speed_metrics(
metric_key_prefix,
start_time,
num_samples=output.num_samples,
num_steps=math.ceil(output.num_samples / total_batch_size),
)
)
if self.post_process_function is None or self.compute_metrics is None:
return output
predictions = self.post_process_function(predict_examples, predict_dataset, output.predictions, "predict")
metrics = self.compute_metrics(predictions)
# Prefix all keys with metric_key_prefix + '_'
for key in list(metrics.keys()):
if not key.startswith(f"{metric_key_prefix}_"):
metrics[f"{metric_key_prefix}_{key}"] = metrics.pop(key)
metrics.update(output.metrics)
return PredictionOutput(predictions=predictions.predictions, label_ids=predictions.label_ids, metrics=metrics)
# Training preprocessing
def prepare_train_features(examples):
# Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results
# in one example possible giving several features when a context is long, each of those features having a
# context that overlaps a bit the context of the previous feature.
tokenized_examples = tokenizer(
examples["question"],
examples["context"],
truncation="only_second",
max_length=512,
padding="max_length",
return_offsets_mapping=True
)
# The offset mappings will give us a map from token to character position in the original context. This will
# help us compute the start_positions and end_positions.
offset_mapping = tokenized_examples.pop("offset_mapping")
# Let's label those examples!
tokenized_examples["start_positions"] = []
tokenized_examples["end_positions"] = []
for i, offsets in enumerate(offset_mapping):
# We will label impossible answers with the index of the CLS token.
input_ids = tokenized_examples["input_ids"][i]
cls_index = input_ids.index(tokenizer.cls_token_id)
# Grab the sequence corresponding to that example (to know what is the context and what is the question).
sequence_ids = tokenized_examples.sequence_ids(i)
# One example can give several spans, this is the index of the example containing this span of text.
answers = examples["answers"][i]
# If no answers are given, set the cls_index as answer.
if len(answers["answer_start"]) == 0:
tokenized_examples["start_positions"].append(cls_index)
tokenized_examples["end_positions"].append(cls_index)
else:
# Start/end character index of the answer in the text.
start_char = answers["answer_start"][0]
end_char = start_char + len(answers["text"][0])
# Start token index of the current span in the text.
token_start_index = 0
# End token index of the current span in the text.
token_end_index = len(input_ids) - 1
# Otherwise move the token_start_index and token_end_index to the two ends of the answer.
# Note: we could go after the last offset if the answer is the last word (edge case).
while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
token_start_index += 1
tokenized_examples["start_positions"].append(token_start_index - 1)
while offsets[token_end_index][1] >= end_char:
token_end_index -= 1
tokenized_examples["end_positions"].append(token_end_index + 1)
return tokenized_examples
def postprocess_qa_predictions(
examples,
features,
predictions,
version_2_with_negative = False,
n_best_size = 20,
max_answer_length = 30,
null_score_diff_threshold = 0.0,
):
"""
Post-processes the predictions of a question-answering model to convert them to answers that are substrings of the
original contexts. This is the base postprocessing functions for models that only return start and end logits.
Args:
examples: The non-preprocessed dataset (see the main script for more information).
features: The processed dataset (see the main script for more information).
predictions (:obj:`Tuple[np.ndarray, np.ndarray]`):
The predictions of the model: two arrays containing the start logits and the end logits respectively. Its
first dimension must match the number of elements of :obj:`features`.
version_2_with_negative (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether or not the underlying dataset contains examples with no answers.
n_best_size (:obj:`int`, `optional`, defaults to 20):
The total number of n-best predictions to generate when looking for an answer.
max_answer_length (:obj:`int`, `optional`, defaults to 30):
The maximum length of an answer that can be generated. This is needed because the start and end predictions
are not conditioned on one another.
null_score_diff_threshold (:obj:`float`, `optional`, defaults to 0):
The threshold used to select the null answer: if the best answer has a score that is less than the score of
the null answer minus this threshold, the null answer is selected for this example (note that the score of
the null answer for an example giving several features is the minimum of the scores for the null answer on
each feature: all features must be aligned on the fact they `want` to predict a null answer).
Only useful when :obj:`version_2_with_negative` is :obj:`True`.
"""
if len(predictions) != 2:
raise ValueError("`predictions` should be a tuple with two elements (start_logits, end_logits).")
all_start_logits, all_end_logits = predictions
if len(predictions[0]) != len(features):
raise ValueError(f"Got {len(predictions[0])} predictions and {len(features)} features.")
# Build a map example to its corresponding features.
example_id_to_index = {k: i for i, k in enumerate(examples["id"])}
features_per_example = collections.defaultdict(list)
for i, feature in enumerate(features):
features_per_example[example_id_to_index[feature["example_id"]]].append(i)
# The dictionaries we have to fill.
all_predictions = collections.OrderedDict()
all_nbest_json = collections.OrderedDict()
if version_2_with_negative:
scores_diff_json = collections.OrderedDict()
# Let's loop over all the examples!
for example_index, example in enumerate(tqdm(examples)):
# Those are the indices of the features associated to the current example.
feature_indices = features_per_example[example_index]
min_null_prediction = None
prelim_predictions = []
# Looping through all the features associated to the current example.
for feature_index in feature_indices:
# We grab the predictions of the model for this feature.
start_logits = all_start_logits[feature_index]
end_logits = all_end_logits[feature_index]
# This is what will allow us to map some the positions in our logits to span of texts in the original
# context.
offset_mapping = features[feature_index]["offset_mapping"]
# Optional `token_is_max_context`, if provided we will remove answers that do not have the maximum context
# available in the current feature.
token_is_max_context = features[feature_index].get("token_is_max_context", None)
# Update minimum null prediction.
feature_null_score = start_logits[0] + end_logits[0]
if min_null_prediction is None or min_null_prediction["score"] > feature_null_score:
min_null_prediction = {
"offsets": (0, 0),
"score": feature_null_score,
"start_logit": start_logits[0],
"end_logit": end_logits[0],
}
# Go through all possibilities for the `n_best_size` greater start and end logits.
start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
for start_index in start_indexes:
for end_index in end_indexes:
# Don't consider out-of-scope answers, either because the indices are out of bounds or correspond
# to part of the input_ids that are not in the context.
if (
start_index >= len(offset_mapping)
or end_index >= len(offset_mapping)
or offset_mapping[start_index] is None
or len(offset_mapping[start_index]) < 2
or offset_mapping[end_index] is None
or len(offset_mapping[end_index]) < 2
):
continue
# Don't consider answers with a length that is either < 0 or > max_answer_length.
if end_index < start_index or end_index - start_index + 1 > max_answer_length:
continue
# Don't consider answer that don't have the maximum context available (if such information is
# provided).
if token_is_max_context is not None and not token_is_max_context.get(str(start_index), False):
continue
prelim_predictions.append(
{
"offsets": (offset_mapping[start_index][0], offset_mapping[end_index][1]),
"score": start_logits[start_index] + end_logits[end_index],
"start_logit": start_logits[start_index],
"end_logit": end_logits[end_index],
}
)
if version_2_with_negative and min_null_prediction is not None:
# Add the minimum null prediction
prelim_predictions.append(min_null_prediction)
null_score = min_null_prediction["score"]
# Only keep the best `n_best_size` predictions.
predictions = sorted(prelim_predictions, key=lambda x: x["score"], reverse=True)[:n_best_size]
# Add back the minimum null prediction if it was removed because of its low score.
if (
version_2_with_negative
and min_null_prediction is not None
and not any(p["offsets"] == (0, 0) for p in predictions)
):
predictions.append(min_null_prediction)
# Use the offsets to gather the answer text in the original context.
context = example["context"]
for pred in predictions:
offsets = pred.pop("offsets")
pred["text"] = context[offsets[0] : offsets[1]]
# In the very rare edge case we have not a single non-null prediction, we create a fake prediction to avoid
# failure.
if len(predictions) == 0 or (len(predictions) == 1 and predictions[0]["text"] == ""):
predictions.insert(0, {"text": "empty", "start_logit": 0.0, "end_logit": 0.0, "score": 0.0})
# Compute the softmax of all scores (we do it with numpy to stay independent from torch/tf in this file, using
# the LogSumExp trick).
scores = np.array([pred.pop("score") for pred in predictions])
exp_scores = np.exp(scores - np.max(scores))
probs = exp_scores / exp_scores.sum()
# Include the probabilities in our predictions.
for prob, pred in zip(probs, predictions):
pred["probability"] = prob
# Pick the best prediction. If the null answer is not possible, this is easy.
if not version_2_with_negative:
all_predictions[example["id"]] = predictions[0]["text"]
else:
# Otherwise we first need to find the best non-empty prediction.
i = 0
while predictions[i]["text"] == "":
i += 1
best_non_null_pred = predictions[i]
# Then we compare to the null prediction using the threshold.
score_diff = null_score - best_non_null_pred["start_logit"] - best_non_null_pred["end_logit"]
scores_diff_json[example["id"]] = float(score_diff) # To be JSON-serializable.
if score_diff > null_score_diff_threshold:
all_predictions[example["id"]] = ""
else:
all_predictions[example["id"]] = best_non_null_pred["text"]
# Make `predictions` JSON-serializable by casting np.float back to float.
all_nbest_json[example["id"]] = [
{k: (float(v) if isinstance(v, (np.float16, np.float32, np.float64)) else v) for k, v in pred.items()}
for pred in predictions
]
return all_predictions
# Post-processing:
def post_processing_function(examples, features, predictions, stage="eval"):
# Post-processing: we match the start logits and end logits to answers in the original context.
predictions = postprocess_qa_predictions(
examples=examples,
features=features,
predictions=predictions,
version_2_with_negative=data_args.version_2_with_negative,
n_best_size=data_args.n_best_size,
max_answer_length=data_args.max_answer_length,
null_score_diff_threshold=data_args.null_score_diff_threshold,
output_dir=training_args.output_dir,
log_level=log_level,
prefix=stage,
)
# Format the result to the format the metric expects.
if data_args.version_2_with_negative:
formatted_predictions = [
{"id": str(k), "prediction_text": v, "no_answer_probability": 0.0} for k, v in predictions.items()
]
else:
formatted_predictions = [{"id": str(k), "prediction_text": v} for k, v in predictions.items()]
references = [{"id": str(ex["id"]), "answers": ex[answer_column_name]} for ex in examples]
return EvalPrediction(predictions=formatted_predictions, label_ids=references)
data = data.map(prepare_train_features, batched=True)
# Set up our trainer
trainer = QuestionAnsweringTrainer(
model=model,
train_dataset=data,
tokenizer=tokenizer,
data_collator=default_data_collator,
post_process_function=post_processing_function
)
# Run the trainer
trainer.train()
# Save our model
trainer.save_model("./ft-distilbert")
/opt/venv/lib/python3.10/site-packages/transformers/optimization.py:391: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning warnings.warn(
Step | Training Loss |
---|
# Let's evaluate our model!
# Specify an input question and context
question = "What can an attacker do with XSS?"
with open("./data/xss.txt", "r") as f:
context = f.read()
# Use HuggingFace pipeline to answer the question
question_answerer = pipeline("question-answering", model="./ft-distilbert")
question_answerer(question=question, context=context)
{'score': 7.277844997588545e-05, 'start': 7834, 'end': 7875, 'answer': 'the payload to modify their own profiles,'}
Hopefully your answer was satisfactory! If not, don't worry about it too much, our dataset was extremely small and we only trained for 3 epochs, so some issues can be expected. This is why so many LLM datasets are so big!
Created: 2024-10-23