microsoft · xingdi-eric-yuan · Dec 9, 2025 · Sep 29, 2025 · Dec 9, 2025 · Dec 9, 2025
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -34,6 +34,6 @@ Added in the [analysis](https://github.com/microsoft/debug-gym/tree/main/analysi
 
 Changed the tool-calling syntax to be compatible with the [OpenAI](https://platform.openai.com/docs/guides/function-calling) and [Anthropic](https://docs.anthropic.com/en/docs/agents-and-tools/tool-use) function-calling formats.
 
-* Switched tools (view, rewrite, pdb, listdir, eval) to a function-call API with explicit arguments and environment injection.
+* Switched tools (view, edit, pdb, listdir, eval) to a function-call API with explicit arguments and environment injection.
 * Overhauled LLM interfaces to define, parse, and format function calls, and updated agents to consume `ToolCall` objects.
 * Removed the old conversational-prompt flag from configs.
diff --git a/README.md b/README.md
@@ -61,7 +61,7 @@ debug_gym
 └── llms
 ```
 
-`debug_gym.gym` is a simulation environment. Given a code repository, an agent can iteratively interact with a set of tools, such as `pdb`, that are designed for investigate the code. Once gathered enough information, the agent can propose a patch that rewrites certain lines of the code. The terminal will subsequently execute the new code against a set of test cases.
+`debug_gym.gym` is a simulation environment. Given a code repository, an agent can iteratively interact with a set of tools, such as `pdb`, that are designed for investigate the code. Once gathered enough information, the agent can propose a patch that edits certain lines of the code. The terminal will subsequently execute the new code against a set of test cases.
 
 `debug_gym.agents` are LLM-based debugging agents that use `debug_gym.gym` to interact with code repositories to seek necessary information and thus fix potential bugs. At an interaction step, the agent takes a text observation that describes the environment states and tool states as input, it is expected to generate a command, subsequently, the environment will provide a new text observation in response, describing the state change caused by that command.
 
@@ -85,7 +85,7 @@ One of the core designs of `debug-gym` is the notion of tools. Users can dynamic
 | `eval` | It runs the current code repository using the provided entrypoint (e.g., pytest), and returns the terminal's output (e.g., error message). |
 | `pdb` | Interactive debugger wrapping the [Python pdb tool](https://docs.python.org/3/library/pdb.html). In additon, users can choose to maintain a set of persistent breakpoints (as in some programming IDEs), which are not reset after every eval. With such feature, a new pdb debugging session is activated automatically, with all the breakpoints restored. Note such breakpoint can be cleared by pdb commands such as `cl`. |
 | `grep` | Search for patterns in files within the repository. Supports both literal string matching and regular expressions. Can search in specific files, directories, or the entire repository. Useful for finding code patterns, function definitions, variable usage, or identifying files containing specific text. |
-| `rewrite` | It can be used to rewrite a certain piece of code to fix the bug. The inputs of this tool call include the file path, the start and end line numbers, and the new code. |
+| `edit` | It can be used to edit a certain piece of code to fix the bug. The inputs of this tool call include the file path, the start and end line numbers, and the new code. |
 
 Upon importing a tool, its action space and observation space will be automatically merged into `debug-gym`'s action space and observation space; its instruction will also be merged into the overall instruction provided to the agent (e.g., as system prompt).
 
@@ -99,9 +99,9 @@ We provide the below LLM-based agents, they all have minimal design and serve th
 
 | Agent name | Available Tools | Description |
 | :-: | :-: | :----- |
-| `froggy_agent` | `grep`, `pdb`, `view`, `rewrite`, `eval` (configurable) | Primary debugging agent. Adjust prompts and tool lists in YAML to mimic rewrite-only, grep-heavy, or other workflows. |
+| `froggy_agent` | `grep`, `pdb`, `view`, `edit`, `eval` (configurable) | Primary debugging agent. Adjust prompts and tool lists in YAML to mimic edit-only, grep-heavy, or other workflows. |
 | `solution_agent` | `pdb`, `eval`  | An oracle agent that applies a gold patch (only works with `swebench` and `swesmith` benchmarks for now). The agent checks that tests are failing before applying the patch, and passing after. It also checks that `pdb` tool can be used as expected. |
-| `swe_agent` | `bash`, `rewrite`, `submit` | Baseline agent tailored for the SWE-bench setting that executes bash commands in addition to rewrites. |
+| `swe_agent` | `bash`, `edit`, `submit` | Baseline agent tailored for the SWE-bench setting that executes bash commands in addition to edits. |
 
 ---
 
@@ -115,7 +115,7 @@ To demonstrate how to integrate `debug-gym` with coding tasks and repositories,
 | `swebench`| [https://github.com/princeton-nlp/SWE-bench](https://github.com/princeton-nlp/SWE-bench) |
 | `swesmith`| [https://github.com/SWE-bench/SWE-smith](https://github.com/SWE-bench/SWE-smith) |
 | `r2egym`| [https://github.com/R2E-Gym/R2E-Gym](https://github.com/R2E-Gym/R2E-Gym) |
-| `mini_nightmare` | A set of 10 hand-crafted minimal buggy code snippet where rewrite only agents have harder time to tackle. Read details [here](https://github.com/microsoft/debug-gym/blob/main/data/mini_nightmare/mini_nightmare.md). |
+| `mini_nightmare` | A set of 10 hand-crafted minimal buggy code snippet where edit-only agents have harder time to tackle. Read details [here](https://github.com/microsoft/debug-gym/blob/main/data/mini_nightmare/mini_nightmare.md). |
 
 > [!NOTE]
 > Since debug-gym focuses on debugging task with the use of a debugger, we provide a customized version of `swebench`, called `swebench-debug`, where each problem's codebase already has the gold test patch applied. This allows us to better simulate real-world debugging scenarios where the buggy code is expected to have failing tests and we can set the debugger's entrypoint accordingly. To use `swebench-debug`, set `benchmark: "swebench-debug"` in your config file (see [Running Baselines](#3-running-baselines)).

diff --git a/analysis/json_log_viewer/templates/index.html b/analysis/json_log_viewer/templates/index.html
@@ -902,12 +902,12 @@ <h3><i class="fas fa-search-plus"></i> Step ${stepId}</h3>
                         </div>
 
                         <div class="detail-section">
-                            <div class="detail-header" onclick="toggleSection('rewrite-section')">
-                                <span><i class="fas fa-edit"></i> Rewrite Consumed</span>
+                            <div class="detail-header" onclick="toggleSection('edit-section')">
+                                <span><i class="fas fa-edit"></i> Edit Consumed</span>
                                 <i class="fas fa-chevron-down expand-icon"></i>
                             </div>
-                            <div class="detail-content" id="rewrite-section">
-                                <div class="json-viewer">${formatEmptyValue(data.rewrite_consumed)}</div>
+                            <div class="detail-content" id="edit-section">
+                                <div class="json-viewer">${formatEmptyValue(data.edit_consumed)}</div>
                             </div>
                         </div>
 

diff --git a/analysis/tool_use/episode_length.py b/analysis/tool_use/episode_length.py
@@ -27,7 +27,7 @@ def analyze_froggy_results(model_name):
     Analyzes froggy.jsonl files for a given model to extract success rates and episode lengths.
 
     Args:
-        model_name (str): Path to the model directory (e.g. 'exps/swe-bench/rewrite_4o_0')
+        model_name (str): Path to the model directory (e.g. 'exps/swe-bench/edit_4o_0')
 
     Returns:
         pd.DataFrame: DataFrame containing results by task
@@ -70,7 +70,7 @@ def analyze_froggy_results_with_seeds(base_model_name, seeds=[0, 1, 2]):
     Analyzes and averages results across different seeds for a base model name
 
     Args:
-        base_model_name (str): Base path without seed (e.g. '../exps/swe-bench/rewrite_o3-mini')
+        base_model_name (str): Base path without seed (e.g. '../exps/swe-bench/edit_o3-mini')
         seeds (list): List of seeds to average over
 
     Returns:
@@ -94,7 +94,7 @@ def analyze_froggy_results_with_seeds(base_model_name, seeds=[0, 1, 2]):
 
 def plot_episode_length(df_dict, model_paths, figsize=(12, 7)):
     """
-    Creates a grouped bar chart showing episode lengths for multiple models, grouped by agent types (rewrite, debug), each bar is averaged over seeds with error bars.
+    Creates a grouped bar chart showing episode lengths for multiple models, grouped by agent types (edit, debug), each bar is averaged over seeds with error bars.
     Args:
         df_dict (dict): Dictionary mapping model names to their DataFrames with averaged results
         model_paths (list): List of model paths for custom x-tick labels
@@ -108,7 +108,7 @@ def plot_episode_length(df_dict, model_paths, figsize=(12, 7)):
         # ignore the data points where the agent failed
         if ONLY_SUCCESS:
             df = df[df["success"]]
-        for agent in ["rewrite", "debug"]:
+        for agent in ["edit", "debug"]:
             if agent not in model_name:
                 continue
             episode_length_mean = df["episode_length"].mean()

diff --git a/analysis/tool_use/incorrect_arguments.py b/analysis/tool_use/incorrect_arguments.py
@@ -24,10 +24,10 @@
 
 def analyze_froggy_results(model_name):
     """
-    Analyzes froggy.jsonl files for a given model to extract success rates and rewrite counts.
+    Analyzes froggy.jsonl files for a given model to extract success rates and edit counts.
 
     Args:
-        model_name (str): Path to the model directory (e.g. 'exps/swe-bench/rewrite_4o_0')
+        model_name (str): Path to the model directory (e.g. 'exps/swe-bench/edit_4o_0')
 
     Returns:
         pd.DataFrame: DataFrame containing results by task
@@ -46,7 +46,7 @@ def analyze_froggy_results(model_name):
                 # Extract success status
                 success = data.get("success", False)
 
-                # Count rewrite commands
+                # Count incorrect command usage
                 total_incorrect_arguments = 0
                 episode_length = 0
                 for step in data.get("log", []):
@@ -77,7 +77,7 @@ def analyze_froggy_results_with_seeds(base_model_name, seeds=[0, 1, 2]):
     Analyzes and averages results across different seeds for a base model name
 
     Args:
-        base_model_name (str): Base path without seed (e.g. '../exps/swe-bench/rewrite_o3-mini')
+        base_model_name (str): Base path without seed (e.g. '../exps/swe-bench/edit_o3-mini')
         seeds (list): List of seeds to average over
 
     Returns:
@@ -102,7 +102,7 @@ def analyze_froggy_results_with_seeds(base_model_name, seeds=[0, 1, 2]):
 
 def plot_incorrect_arguments(df_dict, model_paths, figsize=(12, 7)):
     """
-    Creates a grouped bar chart showing episode lengths for multiple models, grouped by agent types (rewrite, pdb, seq), each bar is averaged over seeds (0, 1, 2, with error bars)
+    Creates a grouped bar chart showing episode lengths for multiple models, grouped by agent types (edit, pdb, seq), each bar is averaged over seeds (0, 1, 2, with error bars)
     Args:
         df_dict (dict): Dictionary mapping model names to their DataFrames with averaged results
         model_paths (list): List of model paths for custom x-tick labels
@@ -116,7 +116,7 @@ def plot_incorrect_arguments(df_dict, model_paths, figsize=(12, 7)):
         # ignore the data points where the agent failed
         if ONLY_SUCCESS:
             df = df[df["success"]]
-        for agent in ["rewrite", "debug"]:
+        for agent in ["edit", "debug"]:
             if agent not in model_name:
                 continue
             incorrect_arguments_mean = df["incorrect_arguments"].mean()

diff --git a/analysis/tool_use/response_tokens.py b/analysis/tool_use/response_tokens.py
@@ -27,7 +27,7 @@ def analyze_froggy_results(model_name):
     Analyzes froggy.jsonl files for a given model to extract success rates and token usage.
 
     Args:
-        model_name (str): Path to the model directory (e.g. 'exps/swe-bench/rewrite_4o_0')
+        model_name (str): Path to the model directory (e.g. 'exps/swe-bench/edit_4o_0')
 
     Returns:
         pd.DataFrame: DataFrame containing results by task
@@ -93,7 +93,7 @@ def analyze_froggy_results_with_seeds(base_model_name, seeds=[0, 1, 2]):
     Analyzes and averages results across different seeds for a base model name
 
     Args:
-        base_model_name (str): Base path without seed (e.g. '../exps/swe-bench/rewrite_o3-mini')
+        base_model_name (str): Base path without seed (e.g. '../exps/swe-bench/edit_o3-mini')
         seeds (list): List of seeds to average over
 
     Returns:
@@ -117,7 +117,7 @@ def analyze_froggy_results_with_seeds(base_model_name, seeds=[0, 1, 2]):
 
 def plot_episode_response_tokens(df_dict, model_paths, figsize=(12, 7)):
     """
-    Creates a grouped bar chart showing response tokens per step for multiple models, grouped by agent types (rewrite, debug), each bar is averaged over seeds with error bars.
+    Creates a grouped bar chart showing response tokens per step for multiple models, grouped by agent types (edit, debug), each bar is averaged over seeds with error bars.
     Args:
         df_dict (dict): Dictionary mapping model names to their DataFrames with averaged results
         model_paths (list): List of model paths for custom x-tick labels
@@ -130,7 +130,7 @@ def plot_episode_response_tokens(df_dict, model_paths, figsize=(12, 7)):
         # ignore the data points where the agent failed
         if ONLY_SUCCESS:
             df = df[df["success"]]
-        for agent in ["rewrite", "debug"]:
+        for agent in ["edit", "debug"]:
             if agent not in model_name:
                 continue
             response_tokens_mean = df["response_tokens"].mean()

diff --git a/analysis/tool_use/tool_use_categories.py b/analysis/tool_use/tool_use_categories.py
@@ -23,9 +23,9 @@
 
 def analyze_froggy_results(model_name):
     """
-    Analyzes froggy.jsonl files for a given model to extract success rates and rewrite counts.
+    Analyzes froggy.jsonl files for a given model to extract success rates and edit counts.
     Args:
-        model_name (str): Path to the model directory (e.g. 'exps/swe-bench/rewrite_4o_0')
+        model_name (str): Path to the model directory (e.g. 'exps/swe-bench/edit_4o_0')
 
     Returns:
         pd.DataFrame: DataFrame containing results by task
@@ -45,14 +45,14 @@ def analyze_froggy_results(model_name):
                 # Extract success status
                 success = data.get("success", False)
 
-                # Count rewrite commands
+                # Count tool usage
                 episode_length = 0
 
                 tool_counter = {
                     "view": 0,
                     "listdir": 0,
                     "pdb": 0,
-                    "rewrite": 0,
+                    "edit": 0,
                     "eval": 0,
                     "other": 0,
                 }
@@ -85,7 +85,7 @@ def analyze_froggy_results_with_seeds(base_model_name, seeds=[0, 1, 2]):
     Analyzes and averages results across different seeds for a base model name
 
     Args:
-        base_model_name (str): Base path without seed (e.g. '../exps/may22/rewrite_o3-mini')
+        base_model_name (str): Base path without seed (e.g. '../exps/may22/edit_o3-mini')
         seeds (list): List of seeds to average over
 
     Returns:
@@ -125,7 +125,7 @@ def plot_tool_use_categories(df_dict, model_paths, figsize=(12, 7)):
             "view": 0,
             "listdir": 0,
             "pdb": 0,
-            "rewrite": 0,
+            "edit": 0,
             "eval": 0,
             "other": 0,
         }
@@ -147,7 +147,7 @@ def plot_tool_use_categories(df_dict, model_paths, figsize=(12, 7)):
                 tool_category_per_model["view"],
                 tool_category_per_model["listdir"],
                 tool_category_per_model["pdb"],
-                tool_category_per_model["rewrite"],
+                tool_category_per_model["edit"],
                 tool_category_per_model["eval"],
                 tool_category_per_model["other"],
             ]
@@ -156,15 +156,15 @@ def plot_tool_use_categories(df_dict, model_paths, figsize=(12, 7)):
     # convert to DataFrame
     all_data = pd.DataFrame(
         all_data,
-        columns=["name", "model", "view", "listdir", "pdb", "rewrite", "eval", "other"],
+        columns=["name", "model", "view", "listdir", "pdb", "edit", "eval", "other"],
     )
     # nice palette
     palette = sns.color_palette("Set2")
     # set color
     sns.set_palette(palette)
     # stacked bar plot showing the distribution of PDB command categories for each model
     all_data.set_index("name")[
-        ["view", "listdir", "pdb", "rewrite", "eval", "other"]
+        ["view", "listdir", "pdb", "edit", "eval", "other"]
     ].plot(kind="bar", stacked=True, figsize=figsize)
     plt.xlabel("Backbone LLM")
     plt.ylabel("Percentage")
@@ -173,7 +173,7 @@ def plot_tool_use_categories(df_dict, model_paths, figsize=(12, 7)):
     plt.xticks(
         np.arange(len(all_data)),
         [
-            item.split("/")[-1].replace("rewrite_", "rw ").replace("debug_", "dbg ")
+            item.split("/")[-1].replace("edit_", "ed ").replace("debug_", "dbg ")
             for item in model_paths
         ],
     )

diff --git a/debug_gym/agents/froggy_agent.py b/debug_gym/agents/froggy_agent.py
@@ -11,7 +11,6 @@
 
 @dataclass
 class FroggyAgentArgs(AgentArgs):
-    max_rewrite_steps: int = -1
     show_directory_tree: int = 0
     show_current_breakpoints: bool = False
 
@@ -22,13 +21,6 @@ class FroggyAgent(BaseAgent):
     args_class = FroggyAgentArgs
     system_prompt: str = "{{ agent._default_system_prompt(info) }}"
 
-    def should_stop(self, step: int, info: EnvInfo):
-        should_stop, reason = super().should_stop(step, info)
-        if info.rewrite_counter > self.args.max_rewrite_steps:
-            should_stop = True
-            reason = "max_rewrite_steps reached"
-        return should_stop, reason
-
     def shortcut_features(self):
         features = []
         if self.env.has_tool("pdb"):
@@ -39,7 +31,7 @@ def shortcut_features(self):
             if self.env.get_tool("pdb").persistent_breakpoints:
                 features.append(
                     "The environment will automatically restore existing breakpoints "
-                    "when a new PDB session is started (e.g., after a rewrite)."
+                    "when a new PDB session is started (e.g., after an edit)."
                 )
             if self.env.get_tool("pdb").auto_list:
                 features.append(

diff --git a/debug_gym/agents/history_tracker.py b/debug_gym/agents/history_tracker.py
@@ -65,7 +65,6 @@ def json(self, game_step: int | None = None):
                 "content": None,
                 "action": None,  # env reset
                 "obs": self.env_initial_observation.step_observation.observation,
-                "rewrite_consumed": 0,
                 "prompt_response_pairs": None,
                 "system_message": self.system_message,
                 "problem_message": self.problem_message,
@@ -77,7 +76,6 @@ def json(self, game_step: int | None = None):
                 "reasoning": self.env_observations[game_step].action_reasoning,
                 "action": asdict(self.env_observations[game_step].action_tool_call),
                 "obs": self.env_observations[game_step].step_observation.observation,
-                "rewrite_consumed": self.env_observations[game_step].rewrite_counter,
             }
             # prompt_response_pairs could be empty for the initial state
             if self.llm_responses[game_step]: