R1 Computer Use

Applying the ideas of Deepseek R1 and Open R1 to computer use.

Overview

r1-computer-use is an experimental project that applies large-scale Reinforcement Learning techniques similar to DeepSeek-R1 to computer usage scenarios. The primary goal is to train an agent to interact with a computer environment (e.g., file system, web browser, command line) while utilizing a neural reward model to validate the correctness of the agent’s actions and reason about intermediate steps.

Architecture

DeepSeek-R1 has shown that large language models can develop powerful reasoning skills through iterative reward optimization. Traditionally, such projects rely on hard verifiers or rule-based scripts to determine correctness in tasks like math or coding. However, these methods are too difficult to reproduce at scale for general computer usage.

We aim to replace hard-coded verifiers with a neural reward model that itself reasons about whether or not the agent’s actions are correct or helpful.

Both the actor and reward models follow a three-step cycle which can be seen as an extention of ReACT into reinforcement learning.

Agent

observation = "Current directory contains: setup.py requirements.txt"
reasoning = """
1. Project appears to be a Python package
2. No virtual environment detected
3. Should create venv before proceeding
"""
action = "python -m venv .venv"

Reward Model

analysis = """
1. Correctly identified project type
2. Appropriate prerequisite check
3. Standard venv location chosen
"""
reward = 0.85

Usage (in progress)

from r1_computer_use import Agent, RewardModel

agent = Agent()
reward_model = RewardModel()

result = agent.run(
    task="Set up Python development environment",
    observe_reasoning=True
)

feedback = reward_model.evaluate(
    actions=result.actions,
    reasoning=result.reasoning
)

Training Pipeline

The training pipeline consists of multiple stages:

Cold Start
- Expert demonstrations with reasoning traces
- Initial reward model training
- Base model fine-tuning
Reasoning-Focused GRPO
- Group-based sampling from current policy
- Reward model evaluates each group
- Compute advantages within groups
- Policy updates with clipped probability ratios
- KL divergence constraint with reference policy
Rejection Sampling Stage
- Filter top-k solutions based on reward model
- Create new training dataset from best examples
- Fine-tune base model on filtered data
General Preference Alignment
- Apply RL to full task distribution
- Use reward models for general preferences
- Focus on helpfulness and safety
- Evaluate complete responses
Evaluation
- Task completion metrics
- Reasoning quality assessment
- Safety verification
- Distribution shift analysis

Roadmap

Collect cold startand neural reward model data (in progress)
SFT train base model
GRPO RL training
Rejection sampling
General preference alignment
Evaluation

Research

Current areas of investigation:

Reward model architectures
Base model evaluations

License

MIT

Citation

@software{r1_computer_use,
  title     = {R1-Computer-Use: Reasoning-First Computer Interaction},
  author    = {Barker, Patrick},
  year      = {2025},
  url       = {https://github.com/agentsea/r1-computer-use},
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

R1 Computer Use

Overview

Architecture

Agent

Reward Model

Usage (in progress)

Training Pipeline

Roadmap

Research

License

Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

R1 Computer Use

Overview

Architecture

Agent

Reward Model

Usage (in progress)

Training Pipeline

Roadmap

Research

License

Citation