Skip to content

Commit 16a6cd6

Browse files
committed
Termination word testing
1 parent 50ed87a commit 16a6cd6

File tree

198 files changed

+8351
-11
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

198 files changed

+8351
-11
lines changed

.gitignore

+2-1
Original file line numberDiff line numberDiff line change
@@ -1 +1,2 @@
1-
testoutput.py
1+
testoutput.py
2+
group_chat_sandpit.py

README.md

+55-1
Original file line numberDiff line numberDiff line change
@@ -109,4 +109,58 @@ Solar 10.7b Instruct | :x: | :x: | Produced a chat conversation
109109
StarCoder2 3b | | |
110110
StarCoder2 7b | | |
111111
StarCoder2 15b | | |
112-
Yi-34b Chat | :large_orange_diamond: | :large_orange_diamond: | Close to a valid drawing, outdated libraries
112+
Yi-34b Chat | :large_orange_diamond: | :large_orange_diamond: | Close to a valid drawing, outdated libraries
113+
114+
---
115+
---
116+
117+
### Non-coding tests
118+
119+
#### Termination word
120+
Background: Tests the ability for an LLM to incorporate a termination word into their response.
121+
122+
Scenario: Uses a Group Chat with a Story_writer and a Product_manager. Story_writer is to write some story ideas and the Product_manager is to review and terminate when satisified by outputting a specific word (e.g. "TERMINATE", "BAZINGA", etc.).
123+
124+
Store_writer's system message: **An ideas person, loves coming up with ideas for kids books.**
125+
126+
Product_manager's system message: **Great in evaluating story ideas from your writers and determining whether they would be unique and interesting for kids. Reply with suggested improvements if they aren't good enough, otherwise reply `{termination_word}` at the end when you're satisfied there's one good story idea.**
127+
128+
Prompt for the chat manager: **Come up with 3 story ideas for Grade 3 kids.**
129+
130+
See the [results](results) folder for code outputs.
131+
132+
Note 1: `TERMINATE` is the standard used by AutoGen.
133+
Note 2: Some LLMs included the terminating word but the quality of the full response was not perfect.
134+
135+
| | Key |
136+
| --- | --- |
137+
| :white_check_mark: | Output termination word correctly |
138+
| :x: | Performed task, didn't output termination word |
139+
| :thumbsdown: | Didn't understand/participate in task |
140+
141+
There were two runs for each word.
142+
143+
**Model** | **TERMINATE** | **ACBDEGFHIKJL** | **AUTOGENDONE** | **BAZINGA!** | **DONESKI** | **Notes**
144+
---|---|---|---|
145+
CodeLlama 7b Python | :thumbsdown: :thumbsdown: | :thumbsdown: :thumbsdown: | :thumbsdown: :thumbsdown: | :thumbsdown: :thumbsdown: | :thumbsdown: :thumbsdown: | |
146+
CodeLlama 13b Python | :thumbsdown: :thumbsdown: | :thumbsdown: :thumbsdown: | :thumbsdown: :thumbsdown: | :thumbsdown: :thumbsdown: | :thumbsdown: :thumbsdown: | |
147+
CodeLlama 34b Instruct | :white_check_mark: :white_check_mark: | :x: :white_check_mark: | :x: :white_check_mark: | :white_check_mark: :white_check_mark: | :white_check_mark: :x: | |
148+
CodeLlama 34b Python | :thumbsdown: :thumbsdown: | :thumbsdown: :thumbsdown: | :thumbsdown: :thumbsdown: | :thumbsdown: :thumbsdown: | :thumbsdown: :thumbsdown: | |
149+
DeepSeek Coder 6.7b | :x: :thumbsdown: | :thumbsdown: :thumbsdown: | :thumbsdown: :thumbsdown: | :thumbsdown: :thumbsdown: | :thumbsdown: :thumbsdown: | |
150+
Llama2 7b Chat | :white_check_mark: :white_check_mark: | :white_check_mark: :white_check_mark: | :white_check_mark: :white_check_mark: | :white_check_mark: :white_check_mark: | :white_check_mark: :white_check_mark: | |
151+
Llama2 13b Chat | :x: :white_check_mark: | :x: :white_check_mark: | :white_check_mark: :white_check_mark: | :white_check_mark: :white_check_mark: | :white_check_mark: :white_check_mark: | |
152+
Mistral 7b 0.2 Instruct | :white_check_mark: :white_check_mark: | :white_check_mark: :white_check_mark: | :white_check_mark: :white_check_mark: | :white_check_mark: :white_check_mark: | :white_check_mark: :white_check_mark: | |
153+
Mixtral 8x7b Q4 | :white_check_mark: :white_check_mark: | :white_check_mark: :white_check_mark: | :white_check_mark: :white_check_mark: | :white_check_mark: :white_check_mark: | :white_check_mark: :white_check_mark: | |
154+
Mixtral 8x7b Q5 | :white_check_mark: :white_check_mark: | :white_check_mark: :white_check_mark: | :white_check_mark: :white_check_mark: | :white_check_mark: :white_check_mark: | :white_check_mark: :white_check_mark: | |
155+
Neural Chat 7b Chat | :white_check_mark: :white_check_mark: | :white_check_mark: :white_check_mark: | :white_check_mark: :white_check_mark: | :white_check_mark: :white_check_mark: | :white_check_mark: :white_check_mark: | |
156+
Nexus Raven | :thumbsdown: :thumbsdown: | :thumbsdown: :thumbsdown: | :thumbsdown: :thumbsdown: | :thumbsdown: :thumbsdown: | :thumbsdown: :thumbsdown: | Tried to call a python function to create the stories |
157+
OpenHermes 7b Mistral | :white_check_mark: :white_check_mark: | :white_check_mark: :white_check_mark: | :white_check_mark: :white_check_mark: | :white_check_mark: :white_check_mark: | :white_check_mark: :white_check_mark: | |
158+
Orca2 13b | :white_check_mark: :white_check_mark: | :x: :white_check_mark: | :x: :white_check_mark: | :x: :white_check_mark: | :white_check_mark: :white_check_mark: | |
159+
Phi-2 | :thumbsdown: :thumbsdown: | :thumbsdown: :x: | :x: :x: | :x: :x: | :x: :thumbsdown: | |
160+
Phind-CodeLlama34b | :white_check_mark: :white_check_mark: | :white_check_mark: :white_check_mark: | :white_check_mark: :white_check_mark: | :white_check_mark: :white_check_mark: | :white_check_mark: :white_check_mark: | |
161+
Qwen 14b | :x: :x: | :white_check_mark: :white_check_mark: | :white_check_mark: :white_check_mark: | :white_check_mark: :white_check_mark: | :white_check_mark: :white_check_mark: | |
162+
Solar 10.7b Instruct | :white_check_mark: :white_check_mark: | :white_check_mark: :white_check_mark: | :white_check_mark: :white_check_mark: | :white_check_mark: :white_check_mark: | :white_check_mark: :white_check_mark: | |
163+
StarCoder2 3b | | | | | | |
164+
StarCoder2 7b | | | | | | |
165+
StarCoder2 15b | | | | | | |
166+
Yi-34b Chat | :white_check_mark: :white_check_mark: | :white_check_mark: :white_check_mark: | :white_check_mark: :white_check_mark: | :white_check_mark: :white_check_mark: | :white_check_mark: :white_check_mark: | |

code_exec_jupyter_llm_fib.py

+3-3
Original file line numberDiff line numberDiff line change
@@ -86,9 +86,9 @@ def close(self):
8686
{"model_name": "phind-codellama:34b-v2", "display_name" : "Phind_CodeLlama_34b_v2"},
8787
{"model_name": "qwen:14b-chat-q6_K", "display_name" : "Qwen_14b_Chat"},
8888
{"model_name": "solar:10.7b-instruct-v1-q5_K_M", "display_name" : "Solar_107b_Instruct"},
89-
{"model_name": "starcoder2:3b", "display_name" : "StarCoder2_3b"},
90-
{"model_name": "starcoder2:7b", "display_name" : "StarCoder2_7b"},
91-
{"model_name": "starcoder2:15b", "display_name" : "StarCoder2_15b"},
89+
# {"model_name": "starcoder2:3b", "display_name" : "StarCoder2_3b"},
90+
# {"model_name": "starcoder2:7b", "display_name" : "StarCoder2_7b"},
91+
# {"model_name": "starcoder2:15b", "display_name" : "StarCoder2_15b"},
9292
{"model_name": "yi:34b-chat-q4_K_M", "display_name" : "Yi_34b_Chat_Q4"},
9393
]
9494

code_exec_jupyter_llm_functioncall.py

+3-3
Original file line numberDiff line numberDiff line change
@@ -57,9 +57,9 @@ def close(self):
5757
{"model_name": "phind-codellama:34b-v2", "display_name" : "Phind_CodeLlama_34b_v2"},
5858
{"model_name": "qwen:14b-chat-q6_K", "display_name" : "Qwen_14b_Chat"},
5959
{"model_name": "solar:10.7b-instruct-v1-q5_K_M", "display_name" : "Solar_107b_Instruct"},
60-
{"model_name": "starcoder2:3b", "display_name" : "StarCoder2_3b"},
61-
{"model_name": "starcoder2:7b", "display_name" : "StarCoder2_7b"},
62-
{"model_name": "starcoder2:15b", "display_name" : "StarCoder2_15b"},
60+
# {"model_name": "starcoder2:3b", "display_name" : "StarCoder2_3b"},
61+
# {"model_name": "starcoder2:7b", "display_name" : "StarCoder2_7b"},
62+
# {"model_name": "starcoder2:15b", "display_name" : "StarCoder2_15b"},
6363
{"model_name": "yi:34b-chat-q4_K_M", "display_name" : "Yi_34b_Chat_Q4"},
6464
]
6565

code_exec_jupyter_llm_stocks.py

+3-3
Original file line numberDiff line numberDiff line change
@@ -87,9 +87,9 @@ def close(self):
8787
{"model_name": "phind-codellama:34b-v2", "display_name" : "Phind_CodeLlama_34b_v2"},
8888
{"model_name": "qwen:14b-chat-q6_K", "display_name" : "Qwen_14b_Chat"},
8989
{"model_name": "solar:10.7b-instruct-v1-q5_K_M", "display_name" : "Solar_107b_Instruct"},
90-
{"model_name": "starcoder2:3b", "display_name" : "StarCoder2_3b"},
91-
{"model_name": "starcoder2:7b", "display_name" : "StarCoder2_7b"},
92-
{"model_name": "starcoder2:15b", "display_name" : "StarCoder2_15b"},
90+
# {"model_name": "starcoder2:3b", "display_name" : "StarCoder2_3b"},
91+
# {"model_name": "starcoder2:7b", "display_name" : "StarCoder2_7b"},
92+
# {"model_name": "starcoder2:15b", "display_name" : "StarCoder2_15b"},
9393
{"model_name": "yi:34b-chat-q4_K_M", "display_name" : "Yi_34b_Chat_Q4"},
9494
]
9595

group_chat_terminate.py

+143
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,143 @@
1+
# Testing TERMINATE after a group chat - Local LLM
2+
# Based on: (NEW)
3+
4+
import os
5+
import autogen
6+
import datetime
7+
import sys # Redirecting standard output to a file instead
8+
9+
# Duplicate output to file and screen
10+
class Tee:
11+
def __init__(self, file_name, mode='w'):
12+
self.file = open(file_name, mode)
13+
self.stdout = sys.stdout
14+
15+
def __enter__(self):
16+
sys.stdout = self
17+
18+
def __exit__(self, exc_type, exc_value, traceback):
19+
self.close()
20+
21+
def write(self, data):
22+
self.file.write(data)
23+
self.stdout.write(data)
24+
25+
def flush(self):
26+
self.file.flush()
27+
28+
def close(self):
29+
if self.file:
30+
self.file.close()
31+
sys.stdout = self.stdout
32+
33+
ollama_models = [
34+
{"model_name": "codellama:7b-python", "display_name": "CodeLlama_7b_Python"},
35+
{"model_name": "codellama:13b-python", "display_name": "CodeLlama_13b_Python"},
36+
{"model_name": "codellama:34b-instruct", "display_name": "CodeLlama_34b_Instruct"},
37+
{"model_name": "codellama:34b-python", "display_name": "CodeLlama_34b_Python"},
38+
{"model_name": "deepseek-coder:6.7b-instruct-q6_K", "display_name": "DeepSeek_Coder"},
39+
{"model_name": "llama2:13b-chat", "display_name" : "Llama2_13b_Chat"},
40+
{"model_name": "llama2:7b-chat-q6_K", "display_name": "Llama2_7b_Chat"},
41+
{"model_name": "mistral:7b-instruct-v0.2-q6_K", "display_name" : "Mistral_7b_Instruct_v2"},
42+
{"model_name": "mixtralq4", "display_name" : "Mixtral_8x7b_Q4"},
43+
{"model_name": "mixtralq5", "display_name" : "Mixtral_8x7b_Q5"},
44+
{"model_name": "neural-chat:7b-v3.3-q6_K", "display_name" : "Neural_Chat_7b"},
45+
{"model_name": "nexusraven", "display_name" : "Nexus_Raven"},
46+
{"model_name": "openhermes:7b-mistral-v2.5-q6_K", "display_name" : "OpenHermes_7b_Mistral_v25"},
47+
{"model_name": "orca2:13b-q5_K_S", "display_name" : "Orca2_13b"},
48+
{"model_name": "phi", "display_name" : "Phi"},
49+
{"model_name": "phind-codellama:34b-v2", "display_name" : "Phind_CodeLlama_34b_v2"},
50+
{"model_name": "qwen:14b-chat-q6_K", "display_name" : "Qwen_14b_Chat"},
51+
{"model_name": "solar:10.7b-instruct-v1-q5_K_M", "display_name" : "Solar_107b_Instruct"},
52+
# {"model_name": "starcoder2:3b", "display_name" : "StarCoder2_3b"},
53+
# {"model_name": "starcoder2:7b", "display_name" : "StarCoder2_7b"},
54+
# {"model_name": "starcoder2:15b", "display_name" : "StarCoder2_15b"},
55+
{"model_name": "yi:34b-chat-q4_K_M", "display_name" : "Yi_34b_Chat_Q4"},
56+
]
57+
58+
59+
# The word we're looking for the LLM to return to terminate the chat.
60+
# Our custom termination message function is a "contains" not "equals", so the word can be anywhere
61+
# in the LLM response. It is, however, case-sensitive.
62+
# If we use "TERMINATE" then it will utilise the under-lying termination check as well. Although
63+
# that requires the whole response to be just: TERMINATE
64+
termination_words = ["DONESKI", "BAZINGA!","AUTOGENDONE", "ACBDEGFHIKJL", "TERMINATE"]
65+
66+
test_prefix = "Term"
67+
68+
# Our termination function
69+
def termination_msg_function(x):
70+
71+
# Output the whole/part match so we've got it in the output file.
72+
if isinstance(x, dict) and str(x.get("content")) == {termination_word}:
73+
print(f"<-- Termination word '{termination_word}' matches WHOLE response -->")
74+
elif isinstance(x, dict) and f"{termination_word}" in str(x.get("content")):
75+
print(f"<-- Termination word '{termination_word}' in PART OF response -->")
76+
77+
is_termination = isinstance(x, dict) and f"{termination_word}" in str(x.get("content"))
78+
return is_termination
79+
80+
for ollama_model in ollama_models:
81+
82+
model_name = ollama_model["model_name"]
83+
display_name = ollama_model["display_name"]
84+
85+
# Loop through the termination words
86+
for termination_word in termination_words:
87+
88+
# Two iterations per model.
89+
for iteration in range(1, 3):
90+
91+
output_file = f"/home/autogen/autogen/ms_working/results/{test_prefix}_{display_name}_{termination_word}_i{iteration}.txt"
92+
93+
if not os.path.exists(output_file):
94+
95+
# Clear the terminal (Unix/Linux/MacOS)
96+
os.system('clear')
97+
98+
with Tee(output_file, 'w'):
99+
100+
# Set the config as our local model
101+
llm_config={
102+
"config_list": [{"model": model_name, "api_key": "NotRequired", "base_url": "http://192.168.0.115:11434/v1"}],
103+
"cache_seed": None,
104+
} ## CRITICAL - ENSURE THERE'S NO CACHING FOR TESTING
105+
106+
user_proxy = autogen.UserProxyAgent(
107+
name="User_proxy",
108+
system_message="A human admin.",
109+
is_termination_msg=termination_msg_function, # Here's our termination function
110+
code_execution_config=False,
111+
human_input_mode="NEVER",
112+
)
113+
storywriter = autogen.AssistantAgent(
114+
name="Story_writer",
115+
system_message="An ideas person, loves coming up with ideas for kids books.",
116+
llm_config=llm_config,
117+
)
118+
pm = autogen.AssistantAgent(
119+
name="Product_manager",
120+
system_message=f"""Great in evaluating story ideas from your writers and determining whether they would be unique and interesting for kids.
121+
Reply with suggested improvements if they aren't good enough, otherwise
122+
reply `{termination_word}` at the end when you're satisfied there's one good story idea.""",
123+
llm_config=llm_config,
124+
)
125+
groupchat = autogen.GroupChat(agents=[user_proxy, storywriter, pm], messages=[], max_round=8,
126+
# For the purposes of the test we'll go around in circles, we're not testing the speaker selection.
127+
speaker_selection_method='round_robin')
128+
manager = autogen.GroupChatManager(groupchat=groupchat, llm_config=llm_config)
129+
130+
# Here is an example of solving a math problem through a conversation between the code writer and the code executor:
131+
132+
today = datetime.datetime.now().strftime("%Y-%m-%d")
133+
134+
print(f"-----\n\n{display_name}\n\n{today}\n\nIteration {iteration}\n\nTerminating on '{termination_word}'\n\n-----\n")
135+
136+
chat_result = user_proxy.initiate_chat(
137+
manager, message="Come up with 3 story ideas for Grade 3 kids."
138+
)
139+
140+
else:
141+
print(f"{output_file} already exists, ignoring.")
142+
143+
# break

0 commit comments

Comments
 (0)