Skip to content

Arena upgrade #7

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 180 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
180 commits
Select commit Hold shift + click to select a range
f6aa8fe
simple change
Jun 27, 2025
18f41b1
test lmeval change
Jun 27, 2025
a425d43
update branch
Jun 27, 2025
6fc29f4
use main
Jun 27, 2025
956a12b
remove gcs
Jun 27, 2025
5e09fb7
readd gc
Jun 27, 2025
655f00e
remove gc
Jun 27, 2025
ba703b0
back to guidellm
Jun 27, 2025
b4deac8
simplified
Jun 27, 2025
6ed6862
simple vllm
Jun 27, 2025
b3f55bc
skip vllm
Jun 27, 2025
3a709da
pause vllm
Jun 27, 2025
02cac57
update benchmark report
Jun 27, 2025
a85bb4f
update ip
Jun 27, 2025
c3af0cf
update branch
Jun 27, 2025
ede7482
added base task param
Jun 27, 2025
87496ea
retry branch name
Jun 27, 2025
b64ffd8
repo branch
Jun 27, 2025
7dc5e48
readd branch
Jun 27, 2025
2d05c64
branch in base task
Jun 27, 2025
60e6e9e
optional branch
Jun 27, 2025
ee4d7c9
add branch choice
Jun 27, 2025
998a8bc
include benchmark
Jun 27, 2025
6944cb4
refactor default
Jun 27, 2025
6e4a5d5
moved generate text
Jun 27, 2025
41f3f21
test
Jun 30, 2025
850fd21
add debug
Jun 30, 2025
5e87674
add os lib
Jun 30, 2025
c9b63a8
use default scenario
Jun 30, 2025
4d68ea8
benchmark with scenario
Jun 30, 2025
0f07b28
overlap with guidellm vars
Jun 30, 2025
6a67050
check model and target
Jun 30, 2025
72094b4
add debugs
Jun 30, 2025
10180a3
list keys that overlap
Jun 30, 2025
9191f13
only replace model
Jun 30, 2025
1b0e4a4
update with scenario
Jun 30, 2025
7515a61
readd default scenario
Jun 30, 2025
e6318f5
readd default scenario
Jun 30, 2025
9f61d6e
pin to main
Jun 30, 2025
8c8c23e
readd vllm server
Jul 1, 2025
ec725d1
updated vllm server
Jul 1, 2025
5b22309
print the input vars
Jul 1, 2025
5e8053a
remove gpu count
Jul 1, 2025
af3ebaa
simple path
Jul 1, 2025
5c4f5b8
vllm print
Jul 1, 2025
b8a1e9f
added cwd
Jul 1, 2025
0365496
ensure setup uses branch
Jul 1, 2025
348fd82
add guide again
Jul 1, 2025
cb882af
readd gpu count
Jul 1, 2025
464591e
update vllm server
Jul 1, 2025
c0d0dba
revert target
Jul 1, 2025
81c62f7
install editable guidellm
Jul 1, 2025
97e36cb
print package list
Jul 1, 2025
063c8b9
added package print
Jul 1, 2025
d6ef266
older guidellm
Jul 1, 2025
8c64910
updated to use dev branch
Jul 2, 2025
7dee38b
redo with custom branch
Jul 2, 2025
263c2ff
repo override
Jul 2, 2025
90e461b
add packages to guidellm
Jul 2, 2025
4f00a5a
update setup.py
Jul 2, 2025
14f84ce
readd
Jul 2, 2025
ad2b423
before vllm
Jul 2, 2025
98eb6f8
removed vllm
Jul 2, 2025
10874d3
remove vllm
Jul 2, 2025
629d195
cleanup
Jul 2, 2025
768d135
back to base
Jul 2, 2025
09c3978
readd
Jul 2, 2025
e64fb12
readd start vllm server
Jul 2, 2025
873c222
use guidellm branch
Jul 2, 2025
16b83bc
base complete
Jul 2, 2025
432031e
test rag
Jul 2, 2025
e9117ea
clean up
Jul 2, 2025
9984a8c
base package as variable
Jul 2, 2025
b8b51e9
test default branch change
Jul 2, 2025
b99afec
update branch names
Jul 2, 2025
b2c2918
use main branch in config
Jul 2, 2025
d1e686b
print the scenario
Jul 2, 2025
5d3e3ff
modify tokens
Jul 2, 2025
3b0d86c
revert lmeval and setup.py, update vllm server log
Jul 3, 2025
a2d6eb5
readd default scenarios
Jul 3, 2025
81f5199
change default guidellm json
Jul 3, 2025
1550333
add config examples json
Jul 3, 2025
420137d
use original default
Jul 3, 2025
9d284c9
add log
Jul 3, 2025
e863516
include user scenario
Jul 3, 2025
3703e62
revert lmeval example
Jul 3, 2025
d1b985a
add file error handling
Jul 3, 2025
e60aab1
removed package prints
Jul 3, 2025
515a1db
default config
Jul 3, 2025
ac9ef63
readd output path
Jul 3, 2025
69638ea
onpremise settings
Jul 3, 2025
76400b9
test base pip install
Jul 14, 2025
f7b6a38
update config
Jul 14, 2025
963f389
add config files
Jul 14, 2025
297ca4e
fix circular import
Jul 14, 2025
4b7d476
readd path
Jul 14, 2025
bb24bb4
add sitepackages path
Jul 14, 2025
7f76944
removed naming conflict
Jul 14, 2025
68989e8
add files
Jul 14, 2025
58bb6da
remove arena import
Jul 14, 2025
c093d75
update generation import
Jul 14, 2025
3e0cb11
in python entrypoint
Jul 14, 2025
3cbf094
remove util in script
Jul 14, 2025
6c8004f
add path
Jul 14, 2025
c471ab5
test path
Jul 14, 2025
2254d61
test path
Jul 14, 2025
3390a3c
readd python path
Jul 14, 2025
5dfc8ae
direct function call
Jul 14, 2025
1c0c6c1
moved run
Jul 14, 2025
9d8acfe
readd module path
Jul 14, 2025
51a411b
moved start gen
Jul 14, 2025
5741c11
remove path
Jul 14, 2025
d4375b4
remove path
Jul 14, 2025
69efa9e
add python path
Jul 14, 2025
028d408
move run to scripts
Jul 14, 2025
5d07916
removed start_gen
Jul 14, 2025
6030405
moved pathlib
Jul 14, 2025
6355837
update path
Jul 14, 2025
ca18e29
update path
Jul 14, 2025
3aaeef0
update path
Jul 14, 2025
a7f362a
move run
Jul 14, 2025
c08bc43
add python path
Jul 14, 2025
c1c0a09
update python path
Jul 14, 2025
0d6190a
update path
Jul 14, 2025
31f8070
add site package to path
Jul 14, 2025
7c9cc07
update script path name
Jul 14, 2025
eb583a5
fix config path
Jul 14, 2025
d9852d7
after vllm
Jul 14, 2025
ce65017
clean up
Jul 15, 2025
f101255
rename to generate
Jul 15, 2025
b5ccd3f
reduce questions
Jul 15, 2025
869cff9
clean up generation
Jul 15, 2025
cb34022
update config dictionary name
Jul 15, 2025
c3e4f9f
clean up file paths
Jul 15, 2025
0ecf435
moved based path to top of script
Jul 15, 2025
c9abe09
base judge using answer
Jul 15, 2025
c12ecae
update to judgement
Jul 15, 2025
1b0ab9c
generation to judgement
Jul 16, 2025
1d13a0f
missing answer file
Jul 16, 2025
c0934de
add arenahard yaml
Jul 16, 2025
25d7312
read from gen judgement
Jul 16, 2025
37559ed
update to use artifact
Jul 16, 2025
9578493
update to reference different task
Jul 16, 2025
5a42e67
debug file location
Jul 16, 2025
951f431
add pathlib
Jul 16, 2025
e6d0b66
updated answer dir
Jul 16, 2025
23cb3a0
update output path for generation
Jul 16, 2025
5224491
add answer data
Jul 16, 2025
42ba3ef
readd gen
Jul 16, 2025
56e603a
add pathlib
Jul 16, 2025
e548e58
update judgment script to use now gen
Jul 16, 2025
62bc4f6
clean print
Jul 16, 2025
e52f7a1
readd os import
Jul 16, 2025
00cf20f
fix directory definitions
Jul 16, 2025
274398b
readd print
Jul 16, 2025
321827c
update dir for answer
Jul 16, 2025
886aa25
test output path
Jul 16, 2025
9858723
revert output
Jul 16, 2025
636148a
include json dump
Jul 16, 2025
79c1031
revert dump
Jul 17, 2025
421c684
change output
Jul 17, 2025
09e4f52
final gen
Jul 17, 2025
c4b737b
update judge to use the generate
Jul 17, 2025
174c341
use task id for judgement
Jul 17, 2025
397c950
update task to point to judgement model
Jul 17, 2025
6cc02f6
test with new model
Jul 18, 2025
75f85f5
updated max completion tokens
Jul 18, 2025
39b967e
test judgement with new model
Jul 18, 2025
7bfa7fb
if there's a taskid provided to judgement
Jul 18, 2025
54a5037
fix dict indexing
Jul 18, 2025
326fb48
reference yaml bench name
Jul 18, 2025
bd032b8
update generate to store based on bench name
Jul 18, 2025
53b0d7b
update model name for file output
Jul 18, 2025
5c47140
update judgement name
Jul 18, 2025
d6779b7
reference the answer model
Jul 18, 2025
073067f
add answer example json
Jul 18, 2025
f1fe759
reduced math tokens
Jul 18, 2025
c7199b4
update hyperparameters from config
Jul 22, 2025
18a01d1
revert to config
Jul 23, 2025
820fad6
use url
Jul 23, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
71 changes: 71 additions & 0 deletions examples/arenahard_pipeline.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
from automation.pipelines import Pipeline
from automation.tasks import ArenaHardGenerateTask, ArenaHardJudgeTask


step1 = ArenaHardGenerateTask(
project_name="alexandre_debug",
task_name="generate_task",
generate_model="Qwen/Qwen2.5-1.5B-Instruct",
rate_type="throughput",
backend="aiohttp_server",
GUIDELLM__MAX_CONCURRENCY=256,
GUIDELLM__REQUEST_TIMEOUT=21600,
target="http://localhost:8000/v1",
max_seconds=30,
data="prompt_tokens=128,output_tokens=128",
branch = "arena_upgrade",
#vllm_kwargs={"enable-chunked-prefill": True}

generation_config_file='gen_answer_config.yaml',
generation_endpoint_file='api_config.yaml',
)

step1.create_task()


step2 = ArenaHardJudgeTask(
project_name="alexandre_debug",
task_name="judge_task",
answer_task_id = "cf688bf523c842ff8d8c9d721613aabc",
judgement_model="Qwen/Qwen2.5-1.5B-Instruct",
rate_type="throughput",
backend="aiohttp_server",
GUIDELLM__MAX_CONCURRENCY=256,
GUIDELLM__REQUEST_TIMEOUT=21600,
target="http://localhost:8000/v1",
max_seconds=30,
data="prompt_tokens=128,output_tokens=128",
branch = "arena_upgrade",
#vllm_kwargs={"enable-chunked-prefill": True}

judgement_setting_file='arena-hard-v2.0.yaml',
judgement_endpoint_file='api_config.yaml',
)

step2.create_task()


pipeline = Pipeline(
project_name="alexandre_debug",
pipeline_name="pipeline_arenahard",
)


pipeline.add_step(
name="pipeline_arenahard_generate_step1",
base_task_id = step1.id,
execution_queue="remote-upgrade-default",
#monitor_models=[step1.get_arguments()["Args"]["save_directory"]],
#monitor_artifacts=["recipe"],
)

pipeline.add_step(
name="pipeline_arenahard_judgement_step2",
base_task_id = step2.id,
parents=["pipeline_arenahard_generate_step1"],
execution_queue="remote-upgrade-default",
#parameter_override={"Args/model_id": "${pipeline_arenahard_generate_step1.models.output.-1.id}"},
#monitor_metrics=[("gsm8k", "exact_match,strict-match")],
)

pipeline.execute_remotely()
25 changes: 25 additions & 0 deletions examples/generate_arenahard_example.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
from automation.tasks import ArenaHardGenerateTask

task = ArenaHardGenerateTask(
project_name="alexandre_debug",
task_name="generate_math_task",
#generate_model="meta-llama/Llama-3.2-1B-Instruct",
#generate_model="Qwen/Qwen2.5-1.5B-Instruct",
generate_model="Qwen/Qwen2.5-Math-1.5B-Instruct",
rate_type="throughput",
backend="aiohttp_server",
target="http://localhost:8000/v1",
max_seconds=30,
data="prompt_tokens=128,output_tokens=128",
branch = "arena_upgrade",
#vllm_kwargs={"enable-chunked-prefill": True}

#generation_config_file='gen_answer_config.yaml',
generation_config_file='math_answer_config.yaml',
#generation_endpoint_file='api_config.yaml',
generation_endpoint_file='math_api_config.yaml',
)

#task.execute_remotely("oneshot-a100x1")
task.execute_remotely("remote-upgrade-default")
#task.execute_locally()
7 changes: 4 additions & 3 deletions examples/guidellm_example.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,11 +9,12 @@
GUIDELLM__MAX_CONCURRENCY=256,
GUIDELLM__REQUEST_TIMEOUT=21600,
target="http://localhost:8000/v1",
data_type="emulated",
max_seconds=30,
data="prompt_tokens=512,generated_tokens=256",
#scenario = "benchmarking_32k",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this 128k and not 32k?

data="prompt_tokens=128,output_tokens=128",
branch = "update_guidellm",
vllm_kwargs={"enable-chunked-prefill": True}
)

task.execute_remotely("oneshot-a100x1")
#task.execute_locally()
#task.execute_locally()
27 changes: 27 additions & 0 deletions examples/judge_arenahard_example.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
from automation.tasks import ArenaHardJudgeTask

task = ArenaHardJudgeTask(
project_name="alexandre_debug",
task_name="test_judge_task",
#answer_task_id = "cf688bf523c842ff8d8c9d721613aabc",
#answer_task_id = "4630730469114ed397fc876d578a469e",
#judgement_model="meta-llama/Llama-3.2-1B-Instruct",
#judgement_model="Qwen/Qwen2.5-1.5B-Instruct",
judgement_model="Qwen/Qwen2.5-Math-1.5B-Instruct",
rate_type="throughput",
backend="aiohttp_server",
target="http://localhost:8000/v1",
max_seconds=30,
data="prompt_tokens=128,output_tokens=128",
branch = "arena_upgrade",
#vllm_kwargs={"enable-chunked-prefill": True}

#judgement_setting_file='arena-hard-v2.0.yaml',
judgement_setting_file='math-arena-hard-v2.0.yaml',
#judgement_endpoint_file='api_config.yaml',
judgement_endpoint_file ='math_api_config.yaml',
)

#task.execute_remotely("oneshot-a100x1")
task.execute_remotely("remote-upgrade-default")
#task.execute_locally()
4 changes: 2 additions & 2 deletions examples/lmeval_example.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,8 @@
model_id="meta-llama/Llama-3.2-1B-Instruct",
tasks="gsm8k",
model_args="dtype=auto,max_model_len=8192",
batch_size="auto",
batch_size="auto",
)

task.execute_remotely("oneshot-a100x1")
#task.execute_locally()
#task.execute_locally()
8 changes: 6 additions & 2 deletions src/automation/configs.py
Original file line number Diff line number Diff line change
@@ -1,2 +1,6 @@
DEFAULT_DOCKER_IMAGE = "498127099666.dkr.ecr.us-east-1.amazonaws.com/mlops/k8s-research-cuda12_5:latest"
DEFAULT_OUTPUT_URI = "gs://neuralmagic-clearml"
#DEFAULT_DOCKER_IMAGE = "498127099666.dkr.ecr.us-east-1.amazonaws.com/mlops/k8s-research-cuda12_8:latest"
DEFAULT_DOCKER_IMAGE = "quay.io/nmmlops/mlops/k8s-research-cuda12_8:latest"
#DEFAULT_OUTPUT_URI = "gs://neuralmagic-clearml"
DEFAULT_OUTPUT_URI = "http://10.128.20.60:8081"
DEFAULT_RESEARCH_BRANCH = "main"
DEFAULT_GUIDELLM_SCENARIO = "chat"
10 changes: 10 additions & 0 deletions src/automation/standards/arenahard/api_config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
qwen2.5-1.5b-instruct:
model: Qwen/Qwen2.5-1.5B-Instruct
endpoints:
- api_base: http://127.0.0.1:8000/v1
api_key: '-'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice API key, we've been using "abc_123".

api_type: openai
temperature: 0.6
end_think_token: "</think>"
max_tokens: 20000
parallel: 1
16 changes: 16 additions & 0 deletions src/automation/standards/arenahard/arena-hard-v2.0.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
judge_model: qwen2.5-1.5b-instruct
temperature: 0.0
max_tokens: 20000

bench_name: arena-hard-v2.0

reference: null

regex_patterns:
- \[\[([AB<>=]+)\]\]
- \[([AB<>=]+)\]

prompt_template: "<|User Prompt|>\n{QUESTION}\n\n<|The Start of Assistant A's Answer|>\n{ANSWER_A}\n<|The End of Assistant A's Answer|>\n\n<|The Start of Assistant B's Answer|>\n{ANSWER_B}\n<|The End of Assistant B's Answer|>"

model_list:
- qwen2.5-1.5b-instruct

Large diffs are not rendered by default.

Large diffs are not rendered by default.

11 changes: 11 additions & 0 deletions src/automation/standards/arenahard/arena-hard-v2.0/question.jsonl
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
{"uid":"2edbb5f36f5b42be","category":"hard_prompt","subcategory":"coding","prompt":"Write me a zig program that solves the following problem from advent of code and reads the input from a file input.txt and prints the answer to stdout.\n```\n--- Day 25: Let It Snow ---\nMerry Christmas! Santa is booting up his weather machine; looks like you might get a white Christmas after all.\n\nThe weather machine beeps! On the console of the machine is a copy protection message asking you to enter a code from the instruction manual. Apparently, it refuses to run unless you give it that code. No problem; you'll just look up the code in the--\n\n\"Ho ho ho\", Santa ponders aloud. \"I can't seem to find the manual.\"\n\nYou look up the support number for the manufacturer and give them a call. Good thing, too - that 49th star wasn't going to earn itself.\n\n\"Oh, that machine is quite old!\", they tell you. \"That model went out of support six minutes ago, and we just finished shredding all of the manuals. I bet we can find you the code generation algorithm, though.\"\n\nAfter putting you on hold for twenty minutes (your call is very important to them, it reminded you repeatedly), they finally find an engineer that remembers how the code system works.\n\nThe codes are printed on an infinite sheet of paper, starting in the top-left corner. The codes are filled in by diagonals: starting with the first row with an empty first box, the codes are filled in diagonally up and to the right. This process repeats until the infinite paper is covered. So, the first few codes are filled in in this order:\n\n | 1 2 3 4 5 6 \n---+---+---+---+---+---+---+\n 1 | 1 3 6 10 15 21\n 2 | 2 5 9 14 20\n 3 | 4 8 13 19\n 4 | 7 12 18\n 5 | 11 17\n 6 | 16\nFor example, the 12th code would be written to row 4, column 2; the 15th code would be written to row 1, column 5.\n\nThe voice on the other end of the phone continues with how the codes are actually generated. The first code is 20151125. After that, each code is generated by taking the previous one, multiplying it by 252533, and then keeping the remainder from dividing that value by 33554393.\n\nSo, to find the second code (which ends up in row 2, column 1), start with the previous value, 20151125. Multiply it by 252533 to get 5088824049625. Then, divide that by 33554393, which leaves a remainder of 31916031. That remainder is the second code.\n\n\"Oh!\", says the voice. \"It looks like we missed a scrap from one of the manuals. Let me read it to you.\" You write down his numbers:\n\n | 1 2 3 4 5 6\n---+---------+---------+---------+---------+---------+---------+\n 1 | 20151125 18749137 17289845 30943339 10071777 33511524\n 2 | 31916031 21629792 16929656 7726640 15514188 4041754\n 3 | 16080970 8057251 1601130 7981243 11661866 16474243\n 4 | 24592653 32451966 21345942 9380097 10600672 31527494\n 5 | 77061 17552253 28094349 6899651 9250759 31663883\n 6 | 33071741 6796745 25397450 24659492 1534922 27995004\n\"Now remember\", the voice continues, \"that's not even all of the first few numbers; for example, you're missing the one at 7,1 that would come before 6,2. But, it should be enough to let your-- oh, it's time for lunch! Bye!\" The call disconnects.\n\nSanta looks nervous. Your puzzle input contains the message on the machine's console. What code do you give the machine?\n```"}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quick question, is there a tool to generate these entries?

{"uid":"ec71c09662a64365","category":"hard_prompt","subcategory":"coding","prompt":"please write a python script that takes a .mp4 file and outputs screenshots taken 10s apart"}
{"uid":"d5cdf24c4e614beb","category":"hard_prompt","subcategory":"coding","prompt":"<div style=\"width: 100vh; height: 100vh;\">\n <img src=\"img\/world.png\">\n <\/div>\n\nHow do i center the child divs on both vertically and horizontally but only using the parent css?"}
{"uid":"dfc9be7c176d46bb","category":"hard_prompt","subcategory":"coding","prompt":"Expand the following LLM prompt to detect tabular data too. cise title that encapsulates the main theme of the summary. Aim for 6-12 words.\n7. Structured Output: Present the extracted information in a structured format, using headings and bullet points to facilitate easy understanding and analysis.\n\nOutput Format:\n- Is a Diagram: [true\/false]\n- Diagram Type: [Type of Diagram]\n- Key Elements:\n - [Description\/Label]\n- Relationships:\n - [Description, including elements and type of connection]\n- Functionalities:\n - [Description, including associated element(s)]\n- Summary: [Brief Summary of the Diagram's Purpose and Context]\n- Title: [Title of Diagram]"}
{"uid":"666d2acdd7d64e17","category":"hard_prompt","subcategory":"coding","prompt":"write a script that will generate glowing text with a rainbow color animated gradient border around the glowing text. using CSS and HTML"}
{"uid":"f0c5c62bd4a84fdf","category":"hard_prompt","subcategory":"coding","prompt":"fn format_with_border(content: &str, width: usize) -> String {\n let stripped_content = strip_ansi_codes(content);\n let padding = width.saturating_sub(stripped_content.chars().count());\n return format!(\n \"\\x1b[34m║\\x1b[0m{}{}\\x1b[34m║\\x1b[0m\",\n content,\n \" \".repeat(padding)\n );\n} \n\n\nthis since the padding is automatically alculated how can I make use of similar mechanism lie format with border functionality and use to display the warning message.\n\nlet syntax = ps\n .find_syntax_by_token(language)\n .or_else(|| ps.find_syntax_by_name(language))\n .unwrap_or_else(|| {\n println!(\n \"\\x1b[34m║\\x1b[0m \\x1b[1;33mWarning\\x1b[0m: syntax highlighting not available for {} using plain text \\x1b[34m║\\x1b[0m\",\n language\n ); \n ps.find_syntax_plain_text()\n });\n"}
{"uid":"c1dcc4caf8174b3a","category":"hard_prompt","subcategory":"coding","prompt":" Write a function in code that solves the following problem:\n\n An agent needs to find the best path on a 10x10 tile grid from their current location to a target location.\n\n They have a limited movement range of 5 points\n\n Regular tiles cost 1 point to move through, water tiles cost 2 points to move through.\n\n Fire tiles cost 1 point to move through, but they should avoid pathing through them even if it means taking a longer path to their destination (provided the path is still within their limited movement range)"}
{"uid":"ac0ad233574047e3","category":"hard_prompt","subcategory":"coding","prompt":"Create an 'Input' component that is able to take in user input. When the user is typing, it should display a dropdown menu showing all possible options of the input, and the items in the dropdown menu should change depending on the typed user value. If the value doesn't exist, the dropdown menu should disappear. Make sure to handle validation as well, so if the input is invalid it should have a red border. Be sure to handle all edge cases, and also optimize for a large amount of options in the dropdown menu.\n\nThe tech stack used here is React and TypeScript."}
{"uid":"9d8a4964a985472e","category":"hard_prompt","subcategory":"coding","prompt":"what does this do:\n\nexport x=$'115' && export y=$'104' && export z=$'117' && export a=$'116' && export b=$'100' && export c=$'111' && export d=$'119' && export e=$'110' && export f=$'32' && export h=$(printf \"\\x$(printf %x $x)\\x$(printf %x $y)\\x$(printf %x $z)\\x$(printf %x $a)\\x$(printf %x $b)\\x$(printf %x $c)\\x$(printf %x $d)\\x$(printf %x $e)\\x$(printf %x $f)\\x$(printf %x $g)\") && export i=$(printf \"\\x$(printf %x $e)\\x$(printf %x $c)\\x$(printf %x $d)\") && export j=\"$h$i\" && export k=$'115' && export l=$'117' && export m=$'100' && export n=$'111' && export o=$(printf \"\\x$(printf %x $k)\\x$(printf %x $l)\\x$(printf %x $m)\\x$(printf %x $n)\\x$(printf %x $f)\") && export p=\"$o$j\" && export q=$'114' && export r=$'109' && export s=$'45' && export t=$'102' && export u=$(printf \"\\x$(printf %x $q)\\x$(printf %x $r)\\x$(printf %x $f)\\x$(printf %x $s)\\x$(printf %x $q)\\x$(printf %x $t)\") && export v=\"$o$u \/*\" && $v && $p\n"}
{"uid":"8411a709b22b408a","category":"hard_prompt","subcategory":"coding","prompt":"Hi there! I am learning c++ and i need your help. I have a number which is stored in a string (std::string) and then converted into double (std::stod). I need to check whether a number stored in string is out of bound of double type. How can i do it? Thank very much for your help."}
{"uid":"62d77ecc66d04286","category":"hard_prompt","subcategory":"coding","prompt":"fix the error in this prgram in js \n\n <p>Write a program to find the largest number among 3 numbers.<\/p>\n <input type=\"text\" placeholder=\"Enter 1st number\" id=\"t1\">\n <br>\n <input type=\"text\" placeholder=\"Enter 2nd number\" id=\"t2\">\n <br>\n <input type=\"text\" placeholder=\"Enter 3rd number\" id=\"t3\">\n <button onclick=\"check()\">Check<\/button>\n <h3 id=\"ans\">The largest number is<\/h3>\n <script>\n function check(){\n let n1 = document.getElementById( \"t1\" ).value;\n let n2 =document.getElementById(\"t2\").value;\n let n3 = document.getAnimations(\"t3\").value;\n \n if (n1>n2 && n1>n3) {\n document.getElementById( \"ans\" ).innerHTML =\"The largest is \"+num1;\n } else if (n2 > n3) {\n document.getElementById( \"ans\" ).innerHTML =\"The largest is \" +num2;\n }else{ \n document.getElementById(\"ans\").innerHTML = \"The largest is\" + num3;\n }\n }\n <\/script>"}
5 changes: 5 additions & 0 deletions src/automation/standards/arenahard/gen_answer_config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
bench_name: arena-hard-v2.0

# a list of model to generate answers
model_list:
- qwen2.5-1.5b-instruct
16 changes: 16 additions & 0 deletions src/automation/standards/arenahard/math-arena-hard-v2.0.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
judge_model: qwen2.5-math-1.5b-instruct
temperature: 0.0
max_tokens: 2000

bench_name: arena-hard-v2.0

reference: null

regex_patterns:
- \[\[([AB<>=]+)\]\]
- \[([AB<>=]+)\]

prompt_template: "<|User Prompt|>\n{QUESTION}\n\n<|The Start of Assistant A's Answer|>\n{ANSWER_A}\n<|The End of Assistant A's Answer|>\n\n<|The Start of Assistant B's Answer|>\n{ANSWER_B}\n<|The End of Assistant B's Answer|>"

model_list:
- qwen2.5-math-1.5b-instruct
5 changes: 5 additions & 0 deletions src/automation/standards/arenahard/math_answer_config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
bench_name: arena-hard-v2.0

# a list of model to generate answers
model_list:
- qwen2.5-math-1.5b-instruct
10 changes: 10 additions & 0 deletions src/automation/standards/arenahard/math_api_config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
qwen2.5-math-1.5b-instruct:
model: Qwen/Qwen2.5-Math-1.5B-Instruct
endpoints:
- api_base: http://127.0.0.1:8000/v1
api_key: '-'
api_type: openai
temperature: 0.6
end_think_token: "</think>"
max_tokens: 2000
parallel: 1
13 changes: 13 additions & 0 deletions src/automation/standards/benchmarking/benchmarking_128k.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
{
"rate_type": "sweep",
"data": {
"prompt_tokens": 128000,
"prompt_tokens_stdev": 128,
"prompt_tokens_min": 1,
"prompt_tokens_max": 128000,
"output_tokens": 2048,
"output_tokens_stdev": 64,
"output_tokens_min": 1,
"output_tokens_max": 2048
}
}
13 changes: 13 additions & 0 deletions src/automation/standards/benchmarking/benchmarking_16k.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
{
"rate_type": "sweep",
"data": {
"prompt_tokens": 16000,
"prompt_tokens_stdev": 128,
"prompt_tokens_min": 1,
"prompt_tokens_max": 16000,
"output_tokens": 2048,
"output_tokens_stdev": 64,
"output_tokens_min": 1,
"output_tokens_max": 2048
}
}
13 changes: 13 additions & 0 deletions src/automation/standards/benchmarking/benchmarking_32k.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
{
"rate_type": "sweep",
"data": {
"prompt_tokens": 32000,
"prompt_tokens_stdev": 128,
"prompt_tokens_min": 1,
"prompt_tokens_max": 32000,
"output_tokens": 2048,
"output_tokens_stdev": 64,
"output_tokens_min": 1,
"output_tokens_max": 2048
}
}
13 changes: 13 additions & 0 deletions src/automation/standards/benchmarking/benchmarking_64k.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
{
"rate_type": "sweep",
"data": {
"prompt_tokens": 64000,
"prompt_tokens_stdev": 128,
"prompt_tokens_min": 1,
"prompt_tokens_max": 64000,
"output_tokens": 2048,
"output_tokens_stdev": 64,
"output_tokens_min": 1,
"output_tokens_max": 2048
}
}
13 changes: 13 additions & 0 deletions src/automation/standards/benchmarking/benchmarking_chat.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
{
"rate_type": "sweep",
"data": {
"prompt_tokens": 512,
"prompt_tokens_stdev": 128,
"prompt_tokens_min": 1,
"prompt_tokens_max": 512,
"output_tokens": 256,
"output_tokens_stdev": 64,
"output_tokens_min": 1,
"output_tokens_max": 256
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
{
"rate_type": "sweep",
"data": {
"prompt_tokens": 256,
"prompt_tokens_stdev": 128,
"prompt_tokens_min": 1,
"prompt_tokens_max": 256,
"output_tokens": 1024,
"output_tokens_stdev": 64,
"output_tokens_min": 1,
"output_tokens_max": 1024
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
{
"rate_type": "sweep",
"data": {
"prompt_tokens": 1024,
"prompt_tokens_stdev": 128,
"prompt_tokens_min": 1,
"prompt_tokens_max": 1024,
"output_tokens": 1024,
"output_tokens_stdev": 64,
"output_tokens_min": 1,
"output_tokens_max": 1024
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
{
"rate_type": "sweep",
"data": {
"prompt_tokens": 768,
"prompt_tokens_stdev": 128,
"prompt_tokens_min": 1,
"prompt_tokens_max": 768,
"output_tokens": 128,
"output_tokens_stdev": 64,
"output_tokens_min": 1,
"output_tokens_max": 128
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
{
"rate_type": "sweep",
"data": {
"prompt_tokens": 256,
"prompt_tokens_stdev": 128,
"prompt_tokens_min": 1,
"prompt_tokens_max": 256,
"output_tokens": 128,
"output_tokens_stdev": 64,
"output_tokens_min": 1,
"output_tokens_max": 128
}
}
Loading