Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve reproducibility and reduce manual setup steps with Calkit #54

Open
wants to merge 17 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .dvc/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
/config.local
/tmp
/cache
Empty file added .dvc/config
Empty file.
3 changes: 3 additions & 0 deletions .dvcignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Add patterns of files dvc should ignore, which could improve
# the performance. Learn more at
# https://dvc.org/doc/user-guide/dvcignore
7 changes: 4 additions & 3 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -111,10 +111,12 @@ RUN curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.1/install.sh | b

# Install Pip
COPY requirements.txt .
# Remove last 3 lines from requirements
RUN head -n -3 requirements.txt > requirements-docker.txt
RUN curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py && \
python3.12 get-pip.py && \
python3.12 -m pip install --upgrade pip && \
python3.12 -m pip install --no-cache-dir -r requirements.txt
python3.12 -m pip install --no-cache-dir -r requirements-docker.txt

# Setup playwright
RUN python3.12 -m playwright install
Expand Down Expand Up @@ -154,6 +156,5 @@ EXPOSE 5900
RUN update-alternatives --install /usr/bin/python python /usr/bin/python3 1 && \
update-alternatives --set python /usr/bin/python3

# Set the entrypoint and default command
ENTRYPOINT ["/bin/bash", "-l", "-c"]
# Set the default command
CMD ["/app/tests/run.sh"]
106 changes: 106 additions & 0 deletions Dockerfile-lock.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
[
{
"RepoTags": [
"swelancer:latest"
],
"Parent": "",
"Comment": "buildkit.dockerfile.v0",
"Created": "2025-03-04T15:53:57.295897293Z",
"Author": "",
"Config": {
"Hostname": "",
"Domainname": "",
"User": "",
"AttachStdin": false,
"AttachStdout": false,
"AttachStderr": false,
"ExposedPorts": {
"5900/tcp": {},
"5901/tcp": {}
},
"Tty": false,
"OpenStdin": false,
"StdinOnce": false,
"Env": [
"PATH=/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
"DEBIAN_FRONTEND=noninteractive",
"NVM_DIR=/root/.nvm",
"PYTHONUNBUFFERED=1",
"PYTHONPATH=/app/tests",
"DISPLAY=:99",
"LIBGL_ALWAYS_INDIRECT=1"
],
"Cmd": [
"/app/tests/run.sh"
],
"ArgsEscaped": true,
"Image": "",
"Volumes": null,
"WorkingDir": "/app/expensify",
"Entrypoint": null,
"OnBuild": null,
"Labels": {
"org.opencontainers.image.ref.name": "ubuntu",
"org.opencontainers.image.version": "22.04"
}
},
"Architecture": "arm64",
"Variant": "v8",
"Os": "linux",
"Size": 2960386821,
"GraphDriver": {
"Data": null,
"Name": "overlayfs"
},
"RootFS": {
"Type": "layers",
"Layers": [
"sha256:59b223680ef978509f20faf6bf069866dfc02c422159347face0a7565d698526",
"sha256:083aa0c019e975878fde8816b094ff8fe36f5d7a2cf5d3e8b5e8557042e5bf2f",
"sha256:b44f43c59dbed4ad9bcc8cd2fc1e588dc496c14c7811f89f241138c4daac08d2",
"sha256:a44c0499d62ca38338c0c0083d297b90e2bdf84e2c53f587a32027aaff23cf08",
"sha256:a8c78b622fa0f488ef0c6e054b3fb4917c8f4ceb92bf31eeee34556fbae05a1a",
"sha256:ab6d0bea1782c2849228e432942bde9afae9646e210b792cdf360291b3a01bc0",
"sha256:640b8702ceabe4380be93ae2cf0c19becf9ee79a0a54bd726b5623e1102e24ae",
"sha256:f9ab13561a3f63367abeb6a6c8ef136513d3f1a6413e0e55f6f621f13bd151a7",
"sha256:06eb55395e37706670ea761cb8efccb0492d62d47cbb3b8bf07b05d0d1118b39",
"sha256:0bf426de6db986f1570292457349882972f21f523e441748a4fcda12b2b5fafb",
"sha256:bd3001c6ba1740ed7e132104c7bd30bf8b4fac9bcbf2952ceec6565525ce49ee",
"sha256:a256a031948fdc0421ab89e55ad2dff939789bc332c931b1b5dfe2fc7774c531",
"sha256:a4ea8b5b02fc803c520a1fb52a5499a32d95448c8dd4be7ec4aa275e2ba70f30",
"sha256:209ac431830b7bab9df6bbc92dcfccf8ecb291d426f7eb51fc6d8f31a49a52df",
"sha256:81f1ebeb59a3241f063b448b0fe248a73d38c114f24214260357386afcfcd69a",
"sha256:1a2a255ec65bb298c3d46cd854712409482371425bbd31e73e1fa5d4ab7a853c",
"sha256:29f6abe9d355814e383e7cd118ed3f97d855a584400a5bc2ff4fa24d2e93fba6",
"sha256:5f702e5b01fcb1e0a49f4ac8f95071c2fd601926763785c0b1bcfc6cafb7c10a",
"sha256:fafac95f98b562e6e891e7d586e5fdd943da1e02fddc6759a930c33a55b39a4d",
"sha256:250e64681bcaec4bd31822c22801e2d38411df9c647f87785c1b2dac3d80be89",
"sha256:99cf2ec7e87cb967d89ba6672810ea5e25d9d9614f0bbe7a666943b4712c92d0",
"sha256:531341ea76521c6da0f9791eace2b9483dce23b6401630ed0f5abbbf663450ff",
"sha256:26306bd1ec3853c9efe6e9bfebf30e5eb7573503a9c9381023107e2ee008f051",
"sha256:0b6977b28c4f8caea8a37cd1c15648b355bde04d6d1ba269f93ca339b6bbbac0",
"sha256:f6cd2afef10a5ea50ff0ba96b7ff2bfba3ccbe90e7e6bd797a8f9e9fe27cce2d",
"sha256:bdad66da59f1c5d06fc974dcd701ea390d2416bc39f33121cff0d3face9b400b",
"sha256:ce08e38d61f0beadad461775318fc5ec0e01000a7e5899fe022ad36d30150033",
"sha256:1d17d081e83b77d38ea94297efc5c4b4193a70e1ad8b50cf509aa0f84e6ab7f0",
"sha256:69064f25bd1b7f83fd3c3238ce815c8e8a7babbbbfe1d4cdbc9d577cfcc31266",
"sha256:89d7e24d6f254d73d75069b00af89a320024469293416220a337afaa7665eda7",
"sha256:501133d870f7ed3b9fb816a957a2fce25bf99fbcf8d8f36ff40be757f1e24404",
"sha256:3342c56211fff5684d94a9ad066a6e173a3254e1e4f6da2d2030f96dc0f3c858",
"sha256:1ec5da3c6a2e25c2965200abfa4fa41355b5f3f7e602f865d2a62c5cd7a55c09",
"sha256:97453ee52650094c6bc3c11074acb349afffefa86e54d9727e8488ea8159daa7",
"sha256:d41808063f49484df2e8c70c852d2a03a9a30eb899ee34f6ece005de2890da4a",
"sha256:bd8b16ca9689bbd1b84bdc60fd7588bb38891ca797350a87bb4a7694380617a3",
"sha256:a2a17adb3d7484dfbd3508d5580a2aca3947e1d23e0065cd07c46629dfe07e73",
"sha256:2f42d1a2b63718c4d6bb71504c07cbf788b0d53a4632d2a285c40ce124ff629b",
"sha256:97ff5503f11d2cce1429388e34f2dfce4ba3c75ce3446e06fbe646315255dc47",
"sha256:21936328edc12c012b48a937986abc8b4da46cc460ff4e174e09b3061bcf6ae4",
"sha256:29863ae308c6f452cee1e0a5d9b3bb513ec7607211bf635b7a1c9f5ba5e99ad2",
"sha256:5f70bf18a086007016e948b04aed3b82103a36bea41755b6cddfaf10ace3c6ef",
"sha256:d326c74f970e5d7371d6e28b7fa8636f1c1c70a02a8dc998498575ec2b699324"
]
},
"DockerfileMD5": "58bf7b42ced92bab16d8c45b0ed105ab",
"DepsMD5s": {}
}
]
7 changes: 4 additions & 3 deletions Dockerfile_x86
Original file line number Diff line number Diff line change
Expand Up @@ -112,10 +112,12 @@ RUN curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.1/install.sh | b

# Install Pip
COPY requirements.txt .
# Remove last 3 lines from requirements
RUN head -n -3 requirements.txt > requirements-docker.txt
RUN curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py && \
python3.12 get-pip.py && \
python3.12 -m pip install --upgrade pip && \
python3.12 -m pip install --no-cache-dir -r requirements.txt
python3.12 -m pip install --no-cache-dir -r requirements-docker.txt

# Setup playwright
RUN python3.12 -m playwright install
Expand Down Expand Up @@ -155,6 +157,5 @@ EXPOSE 5900
RUN update-alternatives --install /usr/bin/python python /usr/bin/python3 1 && \
update-alternatives --set python /usr/bin/python3

# Set the entrypoint and default command
ENTRYPOINT ["/bin/bash", "-l", "-c"]
# Set the default command
CMD ["/app/tests/run.sh"]
105 changes: 105 additions & 0 deletions Dockerfile_x86-lock.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
[
{
"RepoTags": [
"swelancer:latest"
],
"Parent": "",
"Comment": "buildkit.dockerfile.v0",
"Created": "2025-03-04T17:15:52.472497763Z",
"Author": "",
"Config": {
"Hostname": "",
"Domainname": "",
"User": "",
"AttachStdin": false,
"AttachStdout": false,
"AttachStderr": false,
"ExposedPorts": {
"5900/tcp": {},
"5901/tcp": {}
},
"Tty": false,
"OpenStdin": false,
"StdinOnce": false,
"Env": [
"PATH=/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
"DEBIAN_FRONTEND=noninteractive",
"NVM_DIR=/root/.nvm",
"PYTHONUNBUFFERED=1",
"PYTHONPATH=/app/tests",
"DISPLAY=:99",
"LIBGL_ALWAYS_INDIRECT=1"
],
"Cmd": [
"/app/tests/run.sh"
],
"ArgsEscaped": true,
"Image": "",
"Volumes": null,
"WorkingDir": "/app/expensify",
"Entrypoint": null,
"OnBuild": null,
"Labels": {
"org.opencontainers.image.ref.name": "ubuntu",
"org.opencontainers.image.version": "22.04"
}
},
"Architecture": "amd64",
"Os": "linux",
"Size": 3184716912,
"GraphDriver": {
"Data": null,
"Name": "overlayfs"
},
"RootFS": {
"Type": "layers",
"Layers": [
"sha256:270a1170e7e398434ff1b31e17e233f7d7b71aa99a40473615860068e86720af",
"sha256:d35aa8d40890310cbaf405ee75073e852c3f23d70269e4ce5b1a31db07e3b9e2",
"sha256:bb1f51c99ebc6dd30fa2ab310db897b81238580052e94d90e8bcb2914581c2cb",
"sha256:3e6a0ceb7560b2a9122c9d7a98a950a0f1ee861e597ffe8f3853efb9a49b83d0",
"sha256:2e195a7e4421232af7c2f346bbbb4df3ca906c98bbc048555a088f012d4b6fcb",
"sha256:96635f3e156a188263f7b626e43f125ca9073d709f9abc43edb199f94085de99",
"sha256:835c9e4beb20f19fc35bab85a8e5b0a80df429c54a53b6acee7a799e07994b3b",
"sha256:c9f9cf27b7333f0587e2109ae6993c796c85d83473106940217caa007b033d00",
"sha256:f368708d9584c02bd5e89c5e7877f2cb65af6db531b6d5fc13ac50da6bb8fe28",
"sha256:d226c9641cd5533f48804045d2496b5852de4af31c21e028bc47331de1c05aa2",
"sha256:ae3819167bc515e9b1e69c825c0556200c7c77bd9cfdf8faf523363801c4802e",
"sha256:bda3c31835d47f9a4a2ceaffc109f1d11ab808568052e2a87b19ffe212bdb44b",
"sha256:b6a1210dc1b6a5855f7d8f278060e87a835a641d6132d693abe2425b838b2a67",
"sha256:7c49216c1c1a09a794082e3b7f28c3501b60a4c2a5079147657876b98163f95c",
"sha256:5f50dca27de4300d5ceee76f8be1822d85d1518529cc0912adbf1bc1b2bb6c03",
"sha256:f7e30acde38a5c9fb2937593a62f80438b1505027b6d33ddb64d7dd6859a9d92",
"sha256:cfc0087ced7d41daec431faa53775d49f45962bb3fce41b288cbd255d8c46c96",
"sha256:6b4e6bdc626d39d619b957df809aeb46ab2a8d7214f37edfa4ccee5ef3e9f61e",
"sha256:a5a7c767394c1d0bc555710a5ad83ca78eab5cd2d57a5fe204cfd61c075a7905",
"sha256:d0cf426d6630141584009ada303fe4a5ef37e7e57693d8133e90d2af587f5ed8",
"sha256:e7f7e587c4246ff4876631947d6507fabab47167e47fcb453457ea4657d74abf",
"sha256:3dabe8c6e16a7a2e0812d5654201c47ad8e581174a55303926979069cbd6707c",
"sha256:503098980e18460c4225a6bf8bb8351400f385d74808a2b773d2ed58d3cb66a4",
"sha256:78b98b6d49e6830fbf2bcfc1465b50c5300f5c5812dbfffabba7ef3a549cc901",
"sha256:51039e81c0cf704b270fabe53cd3829dab2d6d4bdae103208576b3c840279352",
"sha256:5351f8cded28c9225f233449b3caa86eca39bda8fe19385ccb4422645f0e7dc0",
"sha256:143f963bcb6f205884852a012360444b17eef3ec611a77103b8ac348693335b0",
"sha256:773edb05fd9ed5ebc5253dd5fefc4d24ec5e12e2c7e08247a1cc6ac30b292ba6",
"sha256:da1ff9bc4114f79cbab5d12007f6c5140f1270acdda5c247cc50be0ce5cac02d",
"sha256:985c81504a4fdc8bb1aa0729a1187f4bd5d9009ccaa5b9b4fded29ea26f80fe4",
"sha256:9d3e4720f1fce0d710e2ed9c89f18c3c89790483d4886509cff9923b41ca3cf2",
"sha256:ac60199dfe7586fdcdb067a903ca4a0fd9e2666d9888c329750919647c567db6",
"sha256:aba8bc27c6ff7dc0cd226b232505c59f34bc56704720c6e904f32600e487831f",
"sha256:9750aa67015e8cc539a5f61d45c242390ed6edcbd9ee2e69cb62bfe8ead3b5b2",
"sha256:f933f63abfba210442168399eae3cd7194350318a387fb1c26f53fc3625a119e",
"sha256:a7e65b54195e7580a3a5bd44901df1bded1b0f8520949bd2d6d783be1b4134c0",
"sha256:5de40fc0b1266d57a47ee9fed4952da0a449006e0d1c93fcb98ab1eb64ad213e",
"sha256:5042676ad08ab22c104ac822e461c8ffe95602897dec332bcbf17e3381935b43",
"sha256:f80b256946e144617cea7b66b2147f53c093ad8ba30cea017f68ff18396da8be",
"sha256:59af2a126b027b2abe92cfcd7bf750df91a681463b9d49b75723760452e65eba",
"sha256:0bd5ba3499159c49a7a62428f3bf02290563680c2d2db1c385a4ffb68832c2d5",
"sha256:5f70bf18a086007016e948b04aed3b82103a36bea41755b6cddfaf10ace3c6ef",
"sha256:e50e456b8f1820475b861069c789439595f4096ef57113ba938e3429a361522b"
]
},
"DockerfileMD5": "b9c97dc4c4b1ca8c537d605dc294fe2c",
"DepsMD5s": {}
}
]
82 changes: 22 additions & 60 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,89 +6,50 @@ This repo contains the dataset and code for the paper ["SWE-Lancer: Can Frontier

Thank you so much for checking out our benchmark! If you have questions, run into issues, or want to contribute, please open an issue or pull request. You can also reach us at [email protected] and [email protected] at any time.

We will continue to update this repository with the latest tasks, updates to the scaffolding, and improvements to the codebase
We will continue to update this repository with the latest tasks, updates to the scaffolding, and improvements to the codebase

- If you'd like to use the latest version, please use the `main` branch.

- If you'd like to use the version of the dataset from the paper and codebase at time of paper release, please check out the `paper` branch. Note that the performance outlined in our paper is on our internal scaffold. We've aimed to open-source as much of it as possible, but the open-source agent and harness may not be exactly the same.

- If you'd like to use the version of the dataset from the paper and codebase at time of paper release, please check out the `paper` branch. Note that the performance outlined in our paper is on our internal scaffold. We've aimed to open-source as much of it as possible, but the open-source agent and harness may not be exactly the same.

---

**Step 1: Package Management and Requirements**

Python 3.11 is the most stable version to use with SWE-Lancer.

For package management, this repo comes with a pre-existing virtualenv or you can build one from scratch.
[Calkit](https://github.com/calkit/calkit) is used to
manage the necessary [Docker](https://docker.com) and
[uv](https://github.com/astral-sh/uv) environments,
so all three of these tools must be installed.

We recommend using the pre-built virtualenv with [uv](https://github.com/astral-sh/uv), a lightweight OSS package manager. To do this, run:

```bash
uv sync
source .venv/bin/activate
for proj in nanoeval alcatraz nanoeval_alcatraz; do
uv pip install -e project/"$proj"
done
```
**Step 2: Run the Docker Container**

To use your own virtualenv, without uv, run:
Run the Docker container, building if necessary, by executing:

```bash
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
for proj in nanoeval alcatraz nanoeval_alcatraz; do
pip install -e project/"$proj"
done
calkit xenv -n docker-arm64 -- ISSUE_ID=1 bash /app/tests/run.sh
```

**Step 2: Build the Docker Image**

Please run the command that corresponds to your computer's architecture.
If you are running on an AMD64 (x86) platform, replace `docker-arm64` with
`docker-amd64` in the command above.

For Apple Silicon (or other ARM64 systems):

```bash
docker buildx build \
-f Dockerfile \
--ssh default=$SSH_AUTH_SOCK \
-t swelancer \
.
```
**Step 2: Check Environmental Variables**

For Intel-based Mac (or other x86_64 systems):
To ensure environmental variables are set properly, execute:

```bash
docker buildx build \
-f Dockerfile_x86 \
--platform linux/amd64 \
--ssh default=$SSH_AUTH_SOCK \
-t swelancer \
.
```

After the command completes, run the Docker container.

**Step 3: Configure Environment Variables**

Ensure you have an OpenAI API key and username set on your machine.

Locate the `sample.env` file in the root directory. This file contains template environment variables needed for the application:

```plaintext
# sample.env contents example:
PUSHER_APP_ID=your-app-id
# ... other variables
calkit check env-vars
```

Create a new file named `.env` and copy the contents from `sample.env`.
You will be prompted for any missing environmental variables,
e.g., `OPENAI_API_KEY`,
and these will be added to a `.env` file, which will be ignored by Git.

**Step 4: Running SWE-Lancer**
**Step 3: Running SWE-Lancer**

You are now ready to run the eval with:

```bash
uv run python run_swelancer.py
calkit run
```

You should immediately see logging output as the container gets set up and the tasks are loaded, which may take several minutes. You can adjust the model, concurrency, recording, and other parameters in `run_swelancer.py`.
Expand Down Expand Up @@ -172,14 +133,15 @@ For a complete example of a ComputerInterface implementation, you can refer to t
- Handle network issues gracefully

## Citation

```
@misc{miserendino2025swelancerfrontierllmsearn,
title={SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?},
title={SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?},
author={Samuel Miserendino and Michele Wang and Tejal Patwardhan and Johannes Heidecke},
year={2025},
eprint={2502.12115},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2502.12115},
url={https://arxiv.org/abs/2502.12115},
}
```
Loading