Wikipedia QA Opean AI example¶

In [34]:

  Copied!     
 
import openai
import os
import json

OPEN_AI_API_KEY = os.getenv("OPENAI_API_KEY")
openai.api_key = OPEN_AI_API_KEY
client = openai.OpenAI()
import openai import os import json OPEN_AI_API_KEY = os.getenv("OPENAI_API_KEY") openai.api_key = OPEN_AI_API_KEY client = openai.OpenAI()

OpenAI Model Responses without Customization¶

In [11]:

  Copied!     
 
# creating a prompt
question_prompt = """
Who is the owner of twitter?
Answer: 
"""

# Use completion endpoint
completion = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant",
        },
        {
            "role": "user",
            "content": question_prompt,
        },
    ],
)
# creating a prompt question_prompt = """ Who is the owner of twitter? Answer: """ # Use completion endpoint completion = client.chat.completions.create( model="gpt-3.5-turbo", messages=[ { "role": "system", "content": "You are a helpful assistant", }, { "role": "user", "content": question_prompt, }, ], )

Extracting the Model Response¶

In [23]:

  Copied!     
 
json.loads(completion.json())
json.loads(completion.json())

Out[23]:

{'id': 'chatcmpl-8uIdWIIINa0IAxgBK3KqxKQ5NXbVV',
 'choices': [{'finish_reason': 'stop',
   'index': 0,
   'logprobs': None,
   'message': {'content': 'The CEO and co-founder of Twitter is Jack Dorsey.',
    'role': 'assistant',
    'function_call': None,
    'tool_calls': None}}],
 'created': 1708428698,
 'model': 'gpt-3.5-turbo-0125',
 'object': 'chat.completion',
 'system_fingerprint': 'fp_69829325d0',
 'usage': {'completion_tokens': 12, 'prompt_tokens': 27, 'total_tokens': 39}}

Extracting Response Text¶

In [25]:

  Copied!     
 
print(completion.choices[0].message)
print(completion.choices[0].message)

ChatCompletionMessage(content='The CEO and co-founder of Twitter is Jack Dorsey.', role='assistant', function_call=None, tool_calls=None)

Get external data¶

In [42]:

  Copied!     
 
import requests
import pandas as pd
from dateutil.parser import parse, ParserError
import requests import pandas as pd from dateutil.parser import parse, ParserError

Wikipedia API documentation

In [31]:

  Copied!     
 
# Get the Wikipedia page for "2022" since OpenAI's models stop in 2021
params = {
    "action": "query",
    "prop": "extracts",
    "exlimit": 1,
    "titles": "2022",
    "explaintext": 1,
    "formatversion": 2,
    "format": "json",
}
resp = requests.get("https://en.wikipedia.org/w/api.php", params=params)
response_dict = resp.json()
response_dict["query"]["pages"][0]["extract"].split("\n")[:10]
# Get the Wikipedia page for "2022" since OpenAI's models stop in 2021 params = { "action": "query", "prop": "extracts", "exlimit": 1, "titles": "2022", "explaintext": 1, "formatversion": 2, "format": "json", } resp = requests.get("https://en.wikipedia.org/w/api.php", params=params) response_dict = resp.json() response_dict["query"]["pages"][0]["extract"].split("\n")[:10]

Out[31]:

['2022 (MMXXII) was a common year starting on Saturday of the Gregorian calendar, the 2022nd year of the Common Era (CE) and Anno Domini (AD) designations, the 22nd  year of the 3rd millennium and the 21st century, and the  3rd   year of the 2020s decade.  ',
 'The year 2022 saw the removal of nearly all COVID-19 restrictions and the reopening of international borders in most countries, and the global rollout of COVID-19 vaccines continued. The global economic recovery from the pandemic continued, though many countries experienced an ongoing inflation surge; in response, many central banks raised their interest rates to landmark levels. The world population reached eight billion people in 2022, though the year also witnessed numerous natural disasters, including two devastating Atlantic hurricanes (Fiona and Ian), and the most powerful volcano eruption of the century so far. The later part of the year also saw the first public release of ChatGPT by OpenAI starting an arms race in artificial intelligence which increased in intensity into 2023, as well as the collapse of the cryptocurrency exchange FTX.',
 '2022 was also dominated by wars and armed conflicts. While escalations into the internal conflict in Myanmar and the Tigray War dominated the heightening of tensions within their regions and each caused over 10,000 deaths, 2022 was most notable for the Russian invasion of Ukraine, the largest armed conflict in Europe since World War II. The invasion caused the displacement of 15.7 million Ukrainians (8 million internally displaced persons and 7.7 million refugees), and led to international condemnations and sanctions and nuclear threats, the withdrawal of hundreds of companies from Russia, and the exclusion of Russia from major sporting events.',
 '',
 '',
 '== Events ==',
 '',
 '',
 '=== January ===',
 ' January 1 – The Regional Comprehensive Economic Partnership, the largest free trade area in the world, comes into effect for Australia, Brunei, Cambodia, China, Indonesia, Japan, South Korea, Laos, Malaysia, Myanmar, New Zealand, the Philippines, Singapore, Thailand, and Vietnam.']

Clean the data¶

In [32]:

  Copied!     
 
import pandas as pd

# Load page text into a dataframe
df = pd.DataFrame()
df["text"] = response_dict["query"]["pages"][0]["extract"].split("\n")
df.head()
import pandas as pd # Load page text into a dataframe df = pd.DataFrame() df["text"] = response_dict["query"]["pages"][0]["extract"].split("\n") df.head()

Out[32]:

	text
0	2022 (MMXXII) was a common year starting on Sa...
1	The year 2022 saw the removal of nearly all CO...
2	2022 was also dominated by wars and armed conf...
3
4

In [55]:

  Copied!     
 
def clean_wikipedia_data(df: pd.DataFrame) -> pd.DataFrame:
    df_cleaned = df.copy()
    df_cleaned = df = df_cleaned[
        (df_cleaned["text"].str.len() > 0) & (~df_cleaned["text"].str.startswith("=="))
    ]

    return df_cleaned


def parse_dates(df_cleaned: pd.DataFrame) -> pd.DataFrame:
    # In some cases dates are used as headings instead of being part of the
    # text sample; adjust so dated text samples start with dates
    prefix = ""
    for i, row in df_cleaned.iterrows():
        # If the row already has " - ", it already has the needed date prefix
        if " – " not in row["text"]:
            try:
                # If the row's text is a date, set it as the new prefix
                parse(row["text"])
                prefix = row["text"]
            except ParserError:
                # If the row's text isn't a date, add the prefix
                row["text"] = prefix + " – " + row["text"]
                
    df_cleaned = df_cleaned[df_cleaned["text"].str.contains(" – ")].reset_index(
        drop=True
    )
    
    return df_cleaned


df_cleaned = clean_wikipedia_data(df)
df_cleaned = parse_dates(df_cleaned)
def clean_wikipedia_data(df: pd.DataFrame) -> pd.DataFrame: df_cleaned = df.copy() df_cleaned = df = df_cleaned[ (df_cleaned["text"].str.len() > 0) & (~df_cleaned["text"].str.startswith("==")) ] return df_cleaned def parse_dates(df_cleaned: pd.DataFrame) -> pd.DataFrame: # In some cases dates are used as headings instead of being part of the # text sample; adjust so dated text samples start with dates prefix = "" for i, row in df_cleaned.iterrows(): # If the row already has " - ", it already has the needed date prefix if " – " not in row["text"]: try: # If the row's text is a date, set it as the new prefix parse(row["text"]) prefix = row["text"] except ParserError: # If the row's text isn't a date, add the prefix row["text"] = prefix + " – " + row["text"] df_cleaned = df_cleaned[df_cleaned["text"].str.contains(" – ")].reset_index( drop=True ) return df_cleaned df_cleaned = clean_wikipedia_data(df) df_cleaned = parse_dates(df_cleaned)

In [57]:

  Copied!     
 
df_cleaned.tail()
df_cleaned.tail()

Out[57]:

	text
177	December 21–December 26 – A major winter storm...
178	December 24 – 2022 Fijian general election: Th...
179	December 29 – Brazilian football legend Pelé d...
180	December 31 – Former Pope Benedict XVI dies at...
181	December 7 – The world population was estimate...

In [58]:

  Copied!     
 
df_cleaned.to_csv("data/wikipedia_data.csv")
df_cleaned.to_csv("data/wikipedia_data.csv")

Convert into embeddings¶

To create our chatbot, we'll need to convert our natural language data into numeric representations that our machine learning model can process. We need these representations to capture the relationships within the data so that the model can recognize patterns and identify the most relevant content.

In [1]:

  Copied!     
 
import os
import pandas as pd
import openai
import numpy as np


OPEN_AI_API_KEY = os.getenv("OPENAI_API_KEY")
openai.api_key = OPEN_AI_API_KEY
client = openai.OpenAI()

df = pd.read_csv("data/wikipedia_data.csv", index_col=0)
df.sample(5)
import os import pandas as pd import openai import numpy as np OPEN_AI_API_KEY = os.getenv("OPENAI_API_KEY") openai.api_key = OPEN_AI_API_KEY client = openai.OpenAI() df = pd.read_csv("data/wikipedia_data.csv", index_col=0) df.sample(5)

Out[1]:

	text
35	March 2 – Russian invasion of Ukraine: The Uni...
138	September 26 – The Nord Stream pipeline sabota...
114	August 6 – Terrance Drew is sworn in as prime ...
86	May 28 – Spanish club Real Madrid beat English...
98	July 6 – July 31 – UEFA Women's Euro 2022 is h...

In [4]:

  Copied!     
 
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"

def get_embedding(text, model="text-embedding-3-small"):
    text = text.replace("\n", " ")
    return client.embeddings.create(input=[text], model=model).data[0].embedding


# df["embeddings"] = df.text.apply(
#     lambda x: get_embedding(x, model=EMBEDDING_MODEL_NAME)
# )
# df.to_csv("data/wikipedia_embeddings.csv")
EMBEDDING_MODEL_NAME = "text-embedding-ada-002" def get_embedding(text, model="text-embedding-3-small"): text = text.replace("\n", " ") return client.embeddings.create(input=[text], model=model).data[0].embedding # df["embeddings"] = df.text.apply( # lambda x: get_embedding(x, model=EMBEDDING_MODEL_NAME) # ) # df.to_csv("data/wikipedia_embeddings.csv")

In [5]:

  Copied!     
 
df = pd.read_csv("data/wikipedia_embeddings.csv", index_col=0)
df["embeddings"] = df.embeddings.apply(eval).apply(np.array)
df.sample(5)
df = pd.read_csv("data/wikipedia_embeddings.csv", index_col=0) df["embeddings"] = df.embeddings.apply(eval).apply(np.array) df.sample(5)

Out[5]:

	text	embeddings
161	November 11 – Russian invasion of Ukraine: Ukr...	[-0.01199280098080635, -0.013951582834124565, ...
42	March 7 – COVID-19 pandemic: The global death ...	[0.005762410815805197, -0.006090453360229731, ...
148	October 25 – Amid a government crisis, Rishi S...	[0.0018328798469156027, -0.02193058282136917, ...
45	March 9 – Russian invasion of Ukraine: Russia ...	[0.0011836738558486104, -0.010346542112529278,...
8	January 7 – COVID-19 pandemic: The number of C...	[0.006454604212194681, -0.0033079846762120724,...

Finding relevant data¶

In [6]:

  Copied!     
 
question = "Who is the owner of twitter?"
questionn_embeddings = get_embedding(question, model=EMBEDDING_MODEL_NAME)
question = "Who is the owner of twitter?" questionn_embeddings = get_embedding(question, model=EMBEDDING_MODEL_NAME)

In [7]:

  Copied!     
 
len(questionn_embeddings)
len(questionn_embeddings)

Out[7]:

In [5]:

  Copied!     
 
from scipy import spatial


def distances_from_embeddings(
    query_embedding: list[float],
    embeddings: list[list[float]],
    distance_metric="cosine",
) -> list[list]:
    """Return the distances between a query embedding and a list of embeddings."""
    distance_metrics = {
        "cosine": spatial.distance.cosine,
        "L1": spatial.distance.cityblock,
        "L2": spatial.distance.euclidean,
        "Linf": spatial.distance.chebyshev,
    }
    distances = [
        distance_metrics[distance_metric](query_embedding, embedding)
        for embedding in embeddings
    ]
    return distances
from scipy import spatial def distances_from_embeddings( query_embedding: list[float], embeddings: list[list[float]], distance_metric="cosine", ) -> list[list]: """Return the distances between a query embedding and a list of embeddings.""" distance_metrics = { "cosine": spatial.distance.cosine, "L1": spatial.distance.cityblock, "L2": spatial.distance.euclidean, "Linf": spatial.distance.chebyshev, } distances = [ distance_metrics[distance_metric](query_embedding, embedding) for embedding in embeddings ] return distances

In [10]:

  Copied!     
 
df = df.assign(
    distances=distances_from_embeddings(
        questionn_embeddings, df["embeddings"], distance_metric="cosine"
    )
)
df = df.assign( distances=distances_from_embeddings( questionn_embeddings, df["embeddings"], distance_metric="cosine" ) )

In [15]:

  Copied!     
 
df.sort_values(by='distances').head(3)['text'].tolist()
df.sort_values(by='distances').head(3)['text'].tolist()

Out[15]:

['October 28 – Elon Musk completes his $44 billion acquisition of Twitter.',
 'April 25 – Elon Musk reaches an agreement to acquire the social media network Twitter (which he later rebrands as X) for $44 billion USD, which later closes in October.',
 'January 24 – The federal government under Scott Morrison announces that, after more than three years of confidential negotiations, copyright ownership of the Australian Aboriginal Flag has been transferred to the Commonwealth.']

In [17]:

  Copied!     
 
df.to_csv("data/wikipedia_embeddings_distance.csv")
df.to_csv("data/wikipedia_embeddings_distance.csv")

Providing context in a custom prompt¶

So far we have prepared our dataset, created embeddings, and used unsupervised machine learning to help our model understand the relationships within the data.

Now we're getting to the magic! Our next task is to write a custom prompt that will include the most relevant parts of our dataset. We want our prompt to look something like this:

How much data should we include?¶

Great question! Our data is sorted from most to least relevant -- but how many of those rows can we include?

While we could choose arbitrary number, e.g. the top 5 or top 50 most relevant rows, a better approach is to count the number of tokens we use as we compose our text prompt and use all of the available tokens for each prompt.

Review: A token is the basic unit of text processing in a NLP model. It represents a sequence of characters that the model uses to understand and generate language.

Model usage on OpenAI is priced by the token, and each model supports a limited number of tokens. You can view this limit under the "max request" column on the OpenAI documentation about any given model.

In this course, the demo videos use the pt-3.5-turbo-instruct, which has a limit of about 4,096 tokens. That limit includes both the custom prompt and the response generated by the model.

In [1]:

  Copied!     
 
import os
import pandas as pd
import openai
import numpy as np
import tiktoken
import os import pandas as pd import openai import numpy as np import tiktoken

In [13]:

  Copied!     
 
df = pd.read_csv("data/wikipedia_embeddings_distance.csv", index_col =0).sort_values("distances")
df = pd.read_csv("data/wikipedia_embeddings_distance.csv", index_col =0).sort_values("distances")

In [14]:

  Copied!     
 
tokenizer = tiktoken.get_encoding("cl100k_base")
len(tokenizer.encode("Answer the question based on the context"))
tokenizer = tiktoken.get_encoding("cl100k_base") len(tokenizer.encode("Answer the question based on the context"))

Out[14]:

In [15]:

  Copied!     
 
def get_number_of_tokens(text: str, tokenizer) -> pd.DataFrame:
    return len(tokenizer.encode(text))

df = df.assign(
    length_token = df['text'].apply(lambda x: get_number_of_tokens(x, tokenizer))
)
def get_number_of_tokens(text: str, tokenizer) -> pd.DataFrame: return len(tokenizer.encode(text)) df = df.assign( length_token = df['text'].apply(lambda x: get_number_of_tokens(x, tokenizer)) )

In [32]:

  Copied!     
 
def create_prompt(question, df, tokenizer, max_token_count):
    """
    Given a question and a dataframe containing rows of text and their
    embeddings, return a text prompt to send to a Completion model
    """

    # Count the number of tokens in the prompt template and question
    prompt_template = """
    Answer the question based on the context below, and if the question
    can't be answered based on the context, say "I don't know"

    Context: 

    {}

    ---

    Question: {}
    Answer:"""

    current_token_count = len(tokenizer.encode(prompt_template)) + len(
        tokenizer.encode(question)
    )

    context = []
    for text in df.sort_values("distances")["text"].values:
        # Increase the counter based on the number of tokens in this row
        text_token_count = len(tokenizer.encode(text))
        current_token_count += text_token_count

        # Add the row of text to the list if we haven't exceeded the max
        if current_token_count <= max_token_count:
            context.append(text)
        else:
            break

    return prompt_template.format("n\n###\n\n".join(context), question)
def create_prompt(question, df, tokenizer, max_token_count): """ Given a question and a dataframe containing rows of text and their embeddings, return a text prompt to send to a Completion model """ # Count the number of tokens in the prompt template and question prompt_template = """ Answer the question based on the context below, and if the question can't be answered based on the context, say "I don't know" Context: {} --- Question: {} Answer:""" current_token_count = len(tokenizer.encode(prompt_template)) + len( tokenizer.encode(question) ) context = [] for text in df.sort_values("distances")["text"].values: # Increase the counter based on the number of tokens in this row text_token_count = len(tokenizer.encode(text)) current_token_count += text_token_count # Add the row of text to the list if we haven't exceeded the max if current_token_count <= max_token_count: context.append(text) else: break return prompt_template.format("n\n###\n\n".join(context), question)

In [33]:

  Copied!     
 
question = "Who is the owner of twitter?"
max_token_count = 200

print(create_prompt(question, df, tokenizer, max_token_count))
question = "Who is the owner of twitter?" max_token_count = 200 print(create_prompt(question, df, tokenizer, max_token_count))

    Answer the question based on the context below, and if the question
    can't be answered based on the context, say "I don't know"

    Context: 

    October 28 – Elon Musk completes his $44 billion acquisition of Twitter.n
###

April 25 – Elon Musk reaches an agreement to acquire the social media network Twitter (which he later rebrands as X) for $44 billion USD, which later closes in October.n
###

January 24 – The federal government under Scott Morrison announces that, after more than three years of confidential negotiations, copyright ownership of the Australian Aboriginal Flag has been transferred to the Commonwealth.n
###

October 25 – Amid a government crisis, Rishi Sunak becomes Prime Minister of the United Kingdom, following the resignation of Liz Truss the previous week resulting in a 50-day tenure.

    ---

    Question: Who is the owner of twitter?
    Answer:

In [35]:

  Copied!     
 
completion = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant",
        },
        {
            "role": "user",
            "content": create_prompt(question, df, tokenizer, max_token_count),
        },
    ],
)
completion = client.chat.completions.create( model="gpt-3.5-turbo", messages=[ { "role": "system", "content": "You are a helpful assistant", }, { "role": "user", "content": create_prompt(question, df, tokenizer, max_token_count), }, ], )

In [40]:

  Copied!     
 
completion.choices[0].message
completion.choices[0].message

Out[40]:

ChatCompletionMessage(content='Elon Musk is the owner of Twitter.', role='assistant', function_call=None, tool_calls=None)

Last update: 2024-10-23
Created: 2024-10-23