Promting techninques reviewĀ¶
InĀ [2]:
Copied!
import requests
import os
import requests import os
InĀ [4]:
Copied!
TOGETHER_API_KEY = os.getenv("TOGETHER_API_KEY")
TOGETHER_API_KEY = os.getenv("TOGETHER_API_KEY")
InĀ [5]:
Copied!
ENDPOINT = 'https://api.together.xyz/inference'
ENDPOINT = 'https://api.together.xyz/inference'
InĀ [6]:
Copied!
# Decoding parameters
TEMPERATURE = 0.0
MAX_TOKENS = 512
TOP_P = 1.0
REPITIION_PENALTY = 1.0
# https://huggingface.co/meta-llama/Llama-2-7b-hf
B_INST, E_INST = "[INST]", "[/INST]"
B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n"
# Decoding parameters TEMPERATURE = 0.0 MAX_TOKENS = 512 TOP_P = 1.0 REPITIION_PENALTY = 1.0 # https://huggingface.co/meta-llama/Llama-2-7b-hf B_INST, E_INST = "[INST]", "[/INST]" B_SYS, E_SYS = "<>\n", "\n< >\n\n"
InĀ [7]:
Copied!
def query_together_endpoint(prompt):
return requests.post(ENDPOINT, json={
"model": "togethercomputer/llama-2-7b-chat",
"max_tokens": MAX_TOKENS,
"prompt": prompt,
"request_type": "language-model-inference",
"temperature": TEMPERATURE,
"top_p": TOP_P,
"repetition_penalty": REPITIION_PENALTY,
"stop": [
E_INST,
E_SYS
],
"negative_prompt": "",
}, headers={
"Authorization": f"Bearer {TOGETHER_API_KEY}",
}).json()['output']['choices'][0]['text']
def query_together_endpoint(prompt): return requests.post(ENDPOINT, json={ "model": "togethercomputer/llama-2-7b-chat", "max_tokens": MAX_TOKENS, "prompt": prompt, "request_type": "language-model-inference", "temperature": TEMPERATURE, "top_p": TOP_P, "repetition_penalty": REPITIION_PENALTY, "stop": [ E_INST, E_SYS ], "negative_prompt": "", }, headers={ "Authorization": f"Bearer {TOGETHER_API_KEY}", }).json()['output']['choices'][0]['text']
Helper functionsĀ¶
InĀ [8]:
Copied!
def query_model(prompt, trigger = None, verbose=True, **kwargs):
inst_prompt = f"{B_INST} {prompt} {E_INST}"
if trigger:
inst_prompt = inst_prompt + trigger
generation = query_together_endpoint(inst_prompt)
if verbose:
print(f"*** Prompt ***\n{inst_prompt}")
print(f"*** Generation ***\n{generation}")
return generation
def query_model(prompt, trigger = None, verbose=True, **kwargs): inst_prompt = f"{B_INST} {prompt} {E_INST}" if trigger: inst_prompt = inst_prompt + trigger generation = query_together_endpoint(inst_prompt) if verbose: print(f"*** Prompt ***\n{inst_prompt}") print(f"*** Generation ***\n{generation}") return generation
System PromptsĀ¶
InĀ [9]:
Copied!
ANSWER_STAGE = "Provide the direct answer to the user question."
REASONING_STAGE = "Describe the step by step reasoning to find the answer."
ANSWER_STAGE = "Provide the direct answer to the user question." REASONING_STAGE = "Describe the step by step reasoning to find the answer."
InĀ [10]:
Copied!
# System prompt can be constructed in two ways:
# 1) Answering the question first or
# 2) Providing the reasoning first
# Similar ablation performed in "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models"
# https://arxiv.org/pdf/2201.11903.pdf
SYSTEM_PROMPT_TEMPLATE = """{b_sys}Answer the user's question using the following format:
1) {stage_1}
2) {stage_2}{e_sys}"""
# System prompt can be constructed in two ways: # 1) Answering the question first or # 2) Providing the reasoning first # Similar ablation performed in "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models" # https://arxiv.org/pdf/2201.11903.pdf SYSTEM_PROMPT_TEMPLATE = """{b_sys}Answer the user's question using the following format: 1) {stage_1} 2) {stage_2}{e_sys}"""
Response triggersĀ¶
InĀ [11]:
Copied!
# Chain of thought trigger from "Large Language Models are Zero-Shot Reasoners"
# https://arxiv.org/abs/2205.11916
COT_TRIGGER = "\n\nA: Lets think step by step:"
A_TRIGGER = "\n\nA:"
# Chain of thought trigger from "Large Language Models are Zero-Shot Reasoners" # https://arxiv.org/abs/2205.11916 COT_TRIGGER = "\n\nA: Lets think step by step:" A_TRIGGER = "\n\nA:"
User prompt for our taskĀ¶
InĀ [12]:
Copied!
user_prompt_template = "Q: Llama 2 has a context window of {atten_window} tokens. \
If we are reserving {max_token} of them for the LLM response, \
the system prompt uses {sys_prompt_len}, \
the chain of thought trigger uses only {trigger_len}, \
and finally the conversational history uses {convo_history_len}, \
how many can we use for the user prompt?"
user_prompt_template = "Q: Llama 2 has a context window of {atten_window} tokens. \ If we are reserving {max_token} of them for the LLM response, \ the system prompt uses {sys_prompt_len}, \ the chain of thought trigger uses only {trigger_len}, \ and finally the conversational history uses {convo_history_len}, \ how many can we use for the user prompt?"
InĀ [13]:
Copied!
atten_window = 4096
max_token = 512
sys_prompt_len = 124
trigger_len = 11
convo_history_len = 390
user_prompt = user_prompt_template.format(
atten_window=atten_window,
max_token=max_token,
sys_prompt_len=sys_prompt_len,
trigger_len=trigger_len,
convo_history_len=convo_history_len
)
atten_window = 4096 max_token = 512 sys_prompt_len = 124 trigger_len = 11 convo_history_len = 390 user_prompt = user_prompt_template.format( atten_window=atten_window, max_token=max_token, sys_prompt_len=sys_prompt_len, trigger_len=trigger_len, convo_history_len=convo_history_len )
InĀ [14]:
Copied!
print(user_prompt)
print(user_prompt)
Q: Llama 2 has a context window of 4096 tokens. If we are reserving 512 of them for the LLM response, the system prompt uses 124, the chain of thought trigger uses only 11, and finally the conversational history uses 390, how many can we use for the user prompt?
InĀ [15]:
Copied!
desired_numeric_answer = atten_window - max_token - sys_prompt_len - trigger_len - convo_history_len
desired_numeric_answer
desired_numeric_answer = atten_window - max_token - sys_prompt_len - trigger_len - convo_history_len desired_numeric_answer
Out[15]:
3059
Testing the promptsĀ¶
User prompt onlyĀ¶
InĀ [16]:
Copied!
r = query_model(user_prompt)
r = query_model(user_prompt)
*** Prompt *** [INST] Q: Llama 2 has a context window of 4096 tokens. If we are reserving 512 of them for the LLM response, the system prompt uses 124, the chain of thought trigger uses only 11, and finally the conversational history uses 390, how many can we use for the user prompt? [/INST] *** Generation *** Great, let's do the calculation! You've mentioned that Llama 2 has a context window of 4096 tokens. If we are reserving 512 of them for the LLM response, that means we have 4096 - 512 = 3584 tokens available for other uses. You've also mentioned that the system prompt uses 124 tokens, the chain of thought trigger uses 11 tokens, and the conversational history uses 390 tokens. So, the remaining tokens available for the user prompt are: 3584 - 124 - 11 - 390 = 2089 tokens Therefore, you can use up to 2089 tokens for the user prompt.
User prompt + system prompt v1: answering firstĀ¶
InĀ [17]:
Copied!
system_prompt = SYSTEM_PROMPT_TEMPLATE.format(
b_sys = B_SYS,
stage_1=ANSWER_STAGE,
stage_2=REASONING_STAGE,
e_sys=E_SYS
)
prompt = "".join([system_prompt, user_prompt])
r2 = query_model(prompt)
system_prompt = SYSTEM_PROMPT_TEMPLATE.format( b_sys = B_SYS, stage_1=ANSWER_STAGE, stage_2=REASONING_STAGE, e_sys=E_SYS ) prompt = "".join([system_prompt, user_prompt]) r2 = query_model(prompt)
*** Prompt *** [INST] <<SYS>> Answer the user's question using the following format: 1) Provide the direct answer to the user question. 2) Describe the step by step reasoning to find the answer. <</SYS>> Q: Llama 2 has a context window of 4096 tokens. If we are reserving 512 of them for the LLM response, the system prompt uses 124, the chain of thought trigger uses only 11, and finally the conversational history uses 390, how many can we use for the user prompt? [/INST] *** Generation *** Sure, I'd be happy to help you with that! Here's the answer to your question: 1. Direct answer: The user can use up to 390 tokens for their prompt. Here's the reasoning behind this answer: * The context window for Llama 2 is 4096 tokens. * You mentioned that 512 tokens are reserved for the LLM response. * The system prompt uses 124 tokens. * The chain of thought trigger uses only 11 tokens. * The conversational history uses 390 tokens. So, the remaining tokens available for the user prompt are 4096 - 512 - 124 - 11 = 390 tokens. I hope this helps! Let me know if you have any further questions.
User prompt + system prompt v2: reasoning firstĀ¶
InĀ [18]:
Copied!
system_prompt = SYSTEM_PROMPT_TEMPLATE.format(b_sys = B_SYS, stage_1=REASONING_STAGE, stage_2=ANSWER_STAGE, e_sys=E_SYS)
prompt = "".join([system_prompt, user_prompt])
r3 = query_model(prompt)
system_prompt = SYSTEM_PROMPT_TEMPLATE.format(b_sys = B_SYS, stage_1=REASONING_STAGE, stage_2=ANSWER_STAGE, e_sys=E_SYS) prompt = "".join([system_prompt, user_prompt]) r3 = query_model(prompt)
*** Prompt *** [INST] <<SYS>> Answer the user's question using the following format: 1) Describe the step by step reasoning to find the answer. 2) Provide the direct answer to the user question. <</SYS>> Q: Llama 2 has a context window of 4096 tokens. If we are reserving 512 of them for the LLM response, the system prompt uses 124, the chain of thought trigger uses only 11, and finally the conversational history uses 390, how many can we use for the user prompt? [/INST] *** Generation *** Great, let's break down the calculation to find out how many tokens are available for the user prompt: 1. First, let's calculate the total number of tokens reserved for the LLM response, chain of thought trigger, and conversational history: LLM response: 512 tokens Chain of thought trigger: 11 tokens Conversational history: 390 tokens Total reserved tokens: 512 + 11 + 390 = 903 tokens 2. Now, let's subtract the total reserved tokens from the context window of 4096 tokens to find out how many tokens are available for the user prompt: 4096 - 903 = 3193 tokens Therefore, the user can use up to 3193 tokens for their prompt.
InĀ [23]:
Copied!
print("Correct answer:", 4096-512-124-11-390)
print("Correct answer:", 4096-512-124-11-390)
Correct answer: 3059
User prompt + cot triggerĀ¶
InĀ [24]:
Copied!
r4 = query_model(user_prompt, trigger=COT_TRIGGER)
r4 = query_model(user_prompt, trigger=COT_TRIGGER)
*** Prompt *** [INST] Q: Llama 2 has a context window of 4096 tokens. If we are reserving 512 of them for the LLM response, the system prompt uses 124, the chain of thought trigger uses only 11, and finally the conversational history uses 390, how many can we use for the user prompt? [/INST] A: Lets think step by step: *** Generation *** 1. The context window of Llama 2 is 4096 tokens. 2. You want to reserve 512 tokens for the LLM response. 3. The system prompt uses 124 tokens. 4. The chain of thought trigger uses only 11 tokens. 5. The conversational history uses 390 tokens. Now, let's calculate how many tokens are left for the user prompt: 4096 - 512 = 3584 So, you have 3584 tokens available for the user prompt.
User prompt + "A:" triggerĀ¶
InĀ [21]:
Copied!
r5 = query_model(user_prompt, trigger=A_TRIGGER)
r5 = query_model(user_prompt, trigger=A_TRIGGER)
*** Prompt *** [INST] Q: Llama 2 has a context window of 4096 tokens. If we are reserving 512 of them for the LLM response, the system prompt uses 124, the chain of thought trigger uses only 11, and finally the conversational history uses 390, how many can we use for the user prompt? [/INST] A: *** Generation *** To determine how many context tokens are available for the user prompt, we need to subtract the number of tokens reserved for the LLM response, the system prompt, the chain of thought trigger, and the conversational history from the total context window of 4096 tokens. Reserved tokens for LLM response: 512 Reserved tokens for system prompt: 124 Reserved tokens for chain of thought trigger: 11 Reserved tokens for conversational history: 390 Total reserved tokens: 1037 Now, let's check how many tokens are available for the user prompt: 4096 - 1037 = 3059 So, there are 3059 context tokens available for the user prompt.
Last update: 2024-10-23
Created: 2024-10-23
Created: 2024-10-23