Wikipedia QA Opean AI example¶
import openai
import os
import json
OPEN_AI_API_KEY = os.getenv("OPENAI_API_KEY")
openai.api_key = OPEN_AI_API_KEY
client = openai.OpenAI()
OpenAI Model Responses without Customization¶
# creating a prompt
question_prompt = """
Who is the owner of twitter?
Answer:
"""
# Use completion endpoint
completion = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
{
"role": "system",
"content": "You are a helpful assistant",
},
{
"role": "user",
"content": question_prompt,
},
],
)
Extracting the Model Response¶
json.loads(completion.json())
{'id': 'chatcmpl-8uIdWIIINa0IAxgBK3KqxKQ5NXbVV', 'choices': [{'finish_reason': 'stop', 'index': 0, 'logprobs': None, 'message': {'content': 'The CEO and co-founder of Twitter is Jack Dorsey.', 'role': 'assistant', 'function_call': None, 'tool_calls': None}}], 'created': 1708428698, 'model': 'gpt-3.5-turbo-0125', 'object': 'chat.completion', 'system_fingerprint': 'fp_69829325d0', 'usage': {'completion_tokens': 12, 'prompt_tokens': 27, 'total_tokens': 39}}
Extracting Response Text¶
print(completion.choices[0].message)
ChatCompletionMessage(content='The CEO and co-founder of Twitter is Jack Dorsey.', role='assistant', function_call=None, tool_calls=None)
Get external data¶
import requests
import pandas as pd
from dateutil.parser import parse, ParserError
# Get the Wikipedia page for "2022" since OpenAI's models stop in 2021
params = {
"action": "query",
"prop": "extracts",
"exlimit": 1,
"titles": "2022",
"explaintext": 1,
"formatversion": 2,
"format": "json",
}
resp = requests.get("https://en.wikipedia.org/w/api.php", params=params)
response_dict = resp.json()
response_dict["query"]["pages"][0]["extract"].split("\n")[:10]
['2022 (MMXXII) was a common year starting on Saturday of the Gregorian calendar, the 2022nd year of the Common Era (CE) and Anno Domini (AD) designations, the 22nd year of the 3rd millennium and the 21st century, and the 3rd year of the 2020s decade. ', 'The year 2022 saw the removal of nearly all COVID-19 restrictions and the reopening of international borders in most countries, and the global rollout of COVID-19 vaccines continued. The global economic recovery from the pandemic continued, though many countries experienced an ongoing inflation surge; in response, many central banks raised their interest rates to landmark levels. The world population reached eight billion people in 2022, though the year also witnessed numerous natural disasters, including two devastating Atlantic hurricanes (Fiona and Ian), and the most powerful volcano eruption of the century so far. The later part of the year also saw the first public release of ChatGPT by OpenAI starting an arms race in artificial intelligence which increased in intensity into 2023, as well as the collapse of the cryptocurrency exchange FTX.', '2022 was also dominated by wars and armed conflicts. While escalations into the internal conflict in Myanmar and the Tigray War dominated the heightening of tensions within their regions and each caused over 10,000 deaths, 2022 was most notable for the Russian invasion of Ukraine, the largest armed conflict in Europe since World War II. The invasion caused the displacement of 15.7 million Ukrainians (8 million internally displaced persons and 7.7 million refugees), and led to international condemnations and sanctions and nuclear threats, the withdrawal of hundreds of companies from Russia, and the exclusion of Russia from major sporting events.', '', '', '== Events ==', '', '', '=== January ===', ' January 1 – The Regional Comprehensive Economic Partnership, the largest free trade area in the world, comes into effect for Australia, Brunei, Cambodia, China, Indonesia, Japan, South Korea, Laos, Malaysia, Myanmar, New Zealand, the Philippines, Singapore, Thailand, and Vietnam.']
Clean the data¶
import pandas as pd
# Load page text into a dataframe
df = pd.DataFrame()
df["text"] = response_dict["query"]["pages"][0]["extract"].split("\n")
df.head()
text | |
---|---|
0 | 2022 (MMXXII) was a common year starting on Sa... |
1 | The year 2022 saw the removal of nearly all CO... |
2 | 2022 was also dominated by wars and armed conf... |
3 | |
4 |
def clean_wikipedia_data(df: pd.DataFrame) -> pd.DataFrame:
df_cleaned = df.copy()
df_cleaned = df = df_cleaned[
(df_cleaned["text"].str.len() > 0) & (~df_cleaned["text"].str.startswith("=="))
]
return df_cleaned
def parse_dates(df_cleaned: pd.DataFrame) -> pd.DataFrame:
# In some cases dates are used as headings instead of being part of the
# text sample; adjust so dated text samples start with dates
prefix = ""
for i, row in df_cleaned.iterrows():
# If the row already has " - ", it already has the needed date prefix
if " – " not in row["text"]:
try:
# If the row's text is a date, set it as the new prefix
parse(row["text"])
prefix = row["text"]
except ParserError:
# If the row's text isn't a date, add the prefix
row["text"] = prefix + " – " + row["text"]
df_cleaned = df_cleaned[df_cleaned["text"].str.contains(" – ")].reset_index(
drop=True
)
return df_cleaned
df_cleaned = clean_wikipedia_data(df)
df_cleaned = parse_dates(df_cleaned)
df_cleaned.tail()
text | |
---|---|
177 | December 21–December 26 – A major winter storm... |
178 | December 24 – 2022 Fijian general election: Th... |
179 | December 29 – Brazilian football legend Pelé d... |
180 | December 31 – Former Pope Benedict XVI dies at... |
181 | December 7 – The world population was estimate... |
df_cleaned.to_csv("data/wikipedia_data.csv")
Convert into embeddings¶
To create our chatbot, we'll need to convert our natural language data into numeric representations that our machine learning model can process. We need these representations to capture the relationships within the data so that the model can recognize patterns and identify the most relevant content.
import os
import pandas as pd
import openai
import numpy as np
OPEN_AI_API_KEY = os.getenv("OPENAI_API_KEY")
openai.api_key = OPEN_AI_API_KEY
client = openai.OpenAI()
df = pd.read_csv("data/wikipedia_data.csv", index_col=0)
df.sample(5)
text | |
---|---|
35 | March 2 – Russian invasion of Ukraine: The Uni... |
138 | September 26 – The Nord Stream pipeline sabota... |
114 | August 6 – Terrance Drew is sworn in as prime ... |
86 | May 28 – Spanish club Real Madrid beat English... |
98 | July 6 – July 31 – UEFA Women's Euro 2022 is h... |
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
def get_embedding(text, model="text-embedding-3-small"):
text = text.replace("\n", " ")
return client.embeddings.create(input=[text], model=model).data[0].embedding
# df["embeddings"] = df.text.apply(
# lambda x: get_embedding(x, model=EMBEDDING_MODEL_NAME)
# )
# df.to_csv("data/wikipedia_embeddings.csv")
df = pd.read_csv("data/wikipedia_embeddings.csv", index_col=0)
df["embeddings"] = df.embeddings.apply(eval).apply(np.array)
df.sample(5)
text | embeddings | |
---|---|---|
161 | November 11 – Russian invasion of Ukraine: Ukr... | [-0.01199280098080635, -0.013951582834124565, ... |
42 | March 7 – COVID-19 pandemic: The global death ... | [0.005762410815805197, -0.006090453360229731, ... |
148 | October 25 – Amid a government crisis, Rishi S... | [0.0018328798469156027, -0.02193058282136917, ... |
45 | March 9 – Russian invasion of Ukraine: Russia ... | [0.0011836738558486104, -0.010346542112529278,... |
8 | January 7 – COVID-19 pandemic: The number of C... | [0.006454604212194681, -0.0033079846762120724,... |
Finding relevant data¶
question = "Who is the owner of twitter?"
questionn_embeddings = get_embedding(question, model=EMBEDDING_MODEL_NAME)
len(questionn_embeddings)
1536
from scipy import spatial
def distances_from_embeddings(
query_embedding: list[float],
embeddings: list[list[float]],
distance_metric="cosine",
) -> list[list]:
"""Return the distances between a query embedding and a list of embeddings."""
distance_metrics = {
"cosine": spatial.distance.cosine,
"L1": spatial.distance.cityblock,
"L2": spatial.distance.euclidean,
"Linf": spatial.distance.chebyshev,
}
distances = [
distance_metrics[distance_metric](query_embedding, embedding)
for embedding in embeddings
]
return distances
df = df.assign(
distances=distances_from_embeddings(
questionn_embeddings, df["embeddings"], distance_metric="cosine"
)
)
df.sort_values(by='distances').head(3)['text'].tolist()
['October 28 – Elon Musk completes his $44 billion acquisition of Twitter.', 'April 25 – Elon Musk reaches an agreement to acquire the social media network Twitter (which he later rebrands as X) for $44 billion USD, which later closes in October.', 'January 24 – The federal government under Scott Morrison announces that, after more than three years of confidential negotiations, copyright ownership of the Australian Aboriginal Flag has been transferred to the Commonwealth.']
df.to_csv("data/wikipedia_embeddings_distance.csv")
Providing context in a custom prompt¶
So far we have prepared our dataset, created embeddings, and used unsupervised machine learning to help our model understand the relationships within the data.
Now we're getting to the magic! Our next task is to write a custom prompt that will include the most relevant parts of our dataset. We want our prompt to look something like this:
How much data should we include?¶
Great question! Our data is sorted from most to least relevant -- but how many of those rows can we include?
While we could choose arbitrary number, e.g. the top 5 or top 50 most relevant rows, a better approach is to count the number of tokens we use as we compose our text prompt and use all of the available tokens for each prompt.
Review: A token is the basic unit of text processing in a NLP model. It represents a sequence of characters that the model uses to understand and generate language.
Model usage on OpenAI is priced by the token, and each model supports a limited number of tokens. You can view this limit under the "max request" column on the OpenAI documentation about any given model.
In this course, the demo videos use the pt-3.5-turbo-instruct
, which has a limit of about 4,096 tokens. That limit includes both the custom prompt and the response generated by the model.
import os
import pandas as pd
import openai
import numpy as np
import tiktoken
df = pd.read_csv("data/wikipedia_embeddings_distance.csv", index_col =0).sort_values("distances")
tokenizer = tiktoken.get_encoding("cl100k_base")
len(tokenizer.encode("Answer the question based on the context"))
7
def get_number_of_tokens(text: str, tokenizer) -> pd.DataFrame:
return len(tokenizer.encode(text))
df = df.assign(
length_token = df['text'].apply(lambda x: get_number_of_tokens(x, tokenizer))
)
def create_prompt(question, df, tokenizer, max_token_count):
"""
Given a question and a dataframe containing rows of text and their
embeddings, return a text prompt to send to a Completion model
"""
# Count the number of tokens in the prompt template and question
prompt_template = """
Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"
Context:
{}
---
Question: {}
Answer:"""
current_token_count = len(tokenizer.encode(prompt_template)) + len(
tokenizer.encode(question)
)
context = []
for text in df.sort_values("distances")["text"].values:
# Increase the counter based on the number of tokens in this row
text_token_count = len(tokenizer.encode(text))
current_token_count += text_token_count
# Add the row of text to the list if we haven't exceeded the max
if current_token_count <= max_token_count:
context.append(text)
else:
break
return prompt_template.format("n\n###\n\n".join(context), question)
question = "Who is the owner of twitter?"
max_token_count = 200
print(create_prompt(question, df, tokenizer, max_token_count))
Answer the question based on the context below, and if the question can't be answered based on the context, say "I don't know" Context: October 28 – Elon Musk completes his $44 billion acquisition of Twitter.n ### April 25 – Elon Musk reaches an agreement to acquire the social media network Twitter (which he later rebrands as X) for $44 billion USD, which later closes in October.n ### January 24 – The federal government under Scott Morrison announces that, after more than three years of confidential negotiations, copyright ownership of the Australian Aboriginal Flag has been transferred to the Commonwealth.n ### October 25 – Amid a government crisis, Rishi Sunak becomes Prime Minister of the United Kingdom, following the resignation of Liz Truss the previous week resulting in a 50-day tenure. --- Question: Who is the owner of twitter? Answer:
completion = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
{
"role": "system",
"content": "You are a helpful assistant",
},
{
"role": "user",
"content": create_prompt(question, df, tokenizer, max_token_count),
},
],
)
completion.choices[0].message
ChatCompletionMessage(content='Elon Musk is the owner of Twitter.', role='assistant', function_call=None, tool_calls=None)
Created: 2024-10-23