Skip to content

Conversation

@ilan-gold
Copy link

@ilan-gold ilan-gold commented Oct 28, 2025

Description

Great suite here!

I added an example for annbatch although I'm not exactly sure that's where it should live.

I also tinker with a couple of things, some of which are in the TODOs:

  1. annbatch.ZarrSparseDataset is not based on torch.utils.data.DataLoader so num_workers doesn't really apply - the loader is threaded, but I think that is different (I added a comment about this). So I'm not sure how to handle that. That being said, there are cases we might want to use DataLoader (small block_size) but we wouldn't use all workers, only maybe a fourth of total available threads or something on this order of magnitude. I just let the benchmark does whatever. I guess we'll do a grid search anyway with different numbers of num_workers so it shouldn't matter
  2. I think our data loader is faster when you have chunk_size=1 i.e., perfect randomness, when wrapped in torch.utils.data.DataLoader. But this isn't a requirement - what is interesting is that the create_collection function does a lot of unnecessary work in that case. It shuffles the data so in theory, we could special-case dataset creation for this by just writing the data to disk as zarr v3 anndata I noticed in the other scripts that SingleCellMemMapDataset takes in a pre-computed format, not creating one on disk. Should we do that as well? I am a little confused what the factory function does with dataset argument.
  3. I have noticed you can't vstack a torch.Tensor. I wrote something to get around this using cupy but maybe y'all have some suggestions? It seems like a universal limitation but maybe not? I'll just take this as a 2.0 problem for us and use cupy when needed. What I committed appears to work. It would be great if torch solved this. I also noticed you can't pin memory for a sparse tensor? That seems like it would also be good.
  4. Speaking of cupy, I added it as a dep, but I'm not sure that makes sense. It gives ZarrSparseDataset a performance boost without relying on torch (i.e., so the loader could be used with jax) but also requires a GPU on the installing machine. Is this a safe assumption? Probably not, but I added it anyway to make this clear and can change it as an extra
  5. Is there interest in "fake memory pressure?" This would allow us to "fake" big data by pre-allocating a buffer to take up some predefined percentage of memory. This would make the small dataset case itself a bit more informative but also give nice metrics like available_memory / dataset size ratio as another axis. The 25K dataset is probably too small for this to make sense since it is about 245MB in memory and I could see allocating amount_of_ram - 245MB bytes and then trying to do anything as flaky but we could do this on a ~1GB or ~2GB size dataset and be able to mock this "big data" behavior. I've in general had trouble reasoning about physical RAM available accurately using psutil so this might be something to hardcode i.e., allcoate N bytes because we know there are physically N bytes available. I would be interested in this because we support O_DIRECT reading as we've noticed thrashing on some linux machines when the page cache is full: https://annbatch.readthedocs.io/en/latest/zarr-configuration.html#zarrs-performance
  6. Where to dump the data? I think I understood right that the on-disk dataset creator i.e., shared_dataset_factory should write to disk, but where to? I went with Path(input).parent but maybe that was not right? I couldn't quite be sure from the scdataset example since it include a pre-computed scdl file, no? I think this is a rehash of point 2. In general I'm not exactly clear how to create an on-disk dataset. A new script is needed? Are benchmarking how long it takes to make a dataset?

Marking as a draft since (a) annbatch is not 0.0.1 yet (so don't want to assume performance of anything quite yet since things might change a bit) and (b) I am not sure what else if anything is needed (tests? I see it says that in the PR checklist but it's not applicable here, I think, although I'd be happy to add tests).

In general ready to go!

Usage

This should be in the example, according to the API

Type of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Refactor
  • Documentation update
  • Other (please describe): Benchmarking

CI Pipeline Configuration

Configure CI behavior by applying the relevant labels. By default, only basic unit tests are run.

  • ciflow:skip - Skip all CI tests for this PR
  • ciflow:notebooks - Run Jupyter notebooks execution tests for bionemo2
  • ciflow:slow - Run slow single GPU integration tests marked as @pytest.mark.slow for bionemo2
  • ciflow:all - Run all tests (unit tests, slow tests, and notebooks) for bionemo2. This label can be used to enforce running tests for all bionemo2.
  • ciflow:all-recipes - Run tests for all recipes (under bionemo-recipes). This label can be used to enforce running tests for all recipes.

Unit tests marked as @pytest.mark.multi_gpu or @pytest.mark.distributed are not run in the PR pipeline.

For more details, see CONTRIBUTING

Note

By default, only basic unit tests are run. Add appropriate labels to enable an additional test coverage.

Authorizing CI Runs

We use copy-pr-bot to manage authorization of CI
runs on NVIDIA's compute resources.

  • If a pull request is opened by a trusted user and contains only trusted changes, the pull request's code will
    automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123)
  • If a pull request is opened by an untrusted user or contains untrusted changes, an NVIDIA org member must leave an
    /ok to test comment on the pull request to trigger CI. This will need to be done for each new commit.

Pre-submit Checklist

  • I have tested these changes locally
  • I have updated the documentation accordingly
  • I have added/updated tests as needed
  • All existing tests pass successfully

@copy-pr-bot
Copy link

copy-pr-bot bot commented Oct 28, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@ilan-gold
Copy link
Author

Ok we released 0.0.1 and I pinned it. I noticed that scDataset has released a 0.2.0 which breaks the tests here - maybe best to pin that as well while it's unstable?

ilan-gold and others added 13 commits October 31, 2025 17:28
Signed-off-by: ilan-gold <[email protected]>
Signed-off-by: ilan-gold <[email protected]>
### Description

Adds the codonFM recipe

Does not add top level readme edits.

#### Usage

<!--- How does a user interact with the changed code -->

```python
TODO: Add code snippet
```

### Type of changes

<!-- Mark the relevant option with an [x] -->

- [ ] Bug fix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Refactor
- [ ] Documentation update
- [ ] Other (please describe):

### CI Pipeline Configuration

Configure CI behavior by applying the relevant labels. By default, only
basic unit tests are run.

-
[ciflow:skip](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:skip)
- Skip all CI tests for this PR
-
[ciflow:notebooks](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:notebooks)
- Run Jupyter notebooks execution tests for bionemo2
-
[ciflow:slow](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:slow)
- Run slow single GPU integration tests marked as @pytest.mark.slow for
bionemo2
-
[ciflow:all](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:all)
- Run all tests (unit tests, slow tests, and notebooks) for bionemo2.
This label can be used to enforce running tests for all bionemo2.
-
[ciflow:all-recipes](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:all-recipes)
- Run tests for all recipes (under bionemo-recipes). This label can be
used to enforce running tests for all recipes.

Unit tests marked as `@pytest.mark.multi_gpu` or
`@pytest.mark.distributed` are not run in the PR pipeline.

For more details, see [CONTRIBUTING](CONTRIBUTING.md)

> [!NOTE]
> By default, only basic unit tests are run. Add appropriate labels to
enable an additional test coverage.

#### Authorizing CI Runs

We use
[copy-pr-bot](https://docs.gha-runners.nvidia.com/apps/copy-pr-bot/#automation)
to manage authorization of CI
runs on NVIDIA's compute resources.

- If a pull request is opened by a trusted user and contains only
trusted changes, the pull request's code will
automatically be copied to a pull-request/ prefixed branch in the source
repository (e.g. pull-request/123)
- If a pull request is opened by an untrusted user or contains untrusted
changes, an NVIDIA org member must leave an
`/ok to test` comment on the pull request to trigger CI. This will need
to be done for each new commit.

### Pre-submit Checklist

<!--- Ensure all items are completed before submitting -->

- [ ] I have tested these changes locally
- [ ] I have updated the documentation accordingly
- [ ] I have added/updated tests as needed
- [ ] All existing tests pass successfully

---------

Signed-off-by: Jonathan Mitchell <[email protected]>
Co-authored-by: Yang Zhang <[email protected]>
Co-authored-by: Cory Ye <[email protected]>
Co-authored-by: Peter St. John <[email protected]>
Co-authored-by: Timur Rvachov <[email protected]>
Signed-off-by: ilan-gold <[email protected]>
### Description

<!-- Provide a detailed description of the changes in this PR -->

#### Usage

<!--- How does a user interact with the changed code -->

```python
TODO: Add code snippet
```

### Type of changes

<!-- Mark the relevant option with an [x] -->

- [ ] Bug fix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Refactor
- [ ] Documentation update
- [ ] Other (please describe):

### CI Pipeline Configuration

Configure CI behavior by applying the relevant labels. By default, only
basic unit tests are run.

-
[ciflow:skip](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:skip)
- Skip all CI tests for this PR
-
[ciflow:notebooks](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:notebooks)
- Run Jupyter notebooks execution tests for bionemo2
-
[ciflow:slow](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:slow)
- Run slow single GPU integration tests marked as @pytest.mark.slow for
bionemo2
-
[ciflow:all](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:all)
- Run all tests (unit tests, slow tests, and notebooks) for bionemo2.
This label can be used to enforce running tests for all bionemo2.
-
[ciflow:all-recipes](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:all-recipes)
- Run tests for all recipes (under bionemo-recipes). This label can be
used to enforce running tests for all recipes.

Unit tests marked as `@pytest.mark.multi_gpu` or
`@pytest.mark.distributed` are not run in the PR pipeline.

For more details, see [CONTRIBUTING](CONTRIBUTING.md)

> [!NOTE]
> By default, only basic unit tests are run. Add appropriate labels to
enable an additional test coverage.

#### Authorizing CI Runs

We use
[copy-pr-bot](https://docs.gha-runners.nvidia.com/apps/copy-pr-bot/#automation)
to manage authorization of CI
runs on NVIDIA's compute resources.

- If a pull request is opened by a trusted user and contains only
trusted changes, the pull request's code will
automatically be copied to a pull-request/ prefixed branch in the source
repository (e.g. pull-request/123)
- If a pull request is opened by an untrusted user or contains untrusted
changes, an NVIDIA org member must leave an
`/ok to test` comment on the pull request to trigger CI. This will need
to be done for each new commit.

### Pre-submit Checklist

<!--- Ensure all items are completed before submitting -->

- [ ] I have tested these changes locally
- [ ] I have updated the documentation accordingly
- [ ] I have added/updated tests as needed
- [ ] All existing tests pass successfully

Signed-off-by: Jonathan Mitchell <[email protected]>
Signed-off-by: ilan-gold <[email protected]>
### Description

root README changes to announce CodonFM

### Type of changes

<!-- Mark the relevant option with an [x] -->

- [ ] Bug fix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Refactor
- [ ] Documentation update
- [ ] Other (please describe):

### CI Pipeline Configuration

Configure CI behavior by applying the relevant labels. By default, only
basic unit tests are run.

-
[ciflow:skip](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:skip)
- Skip all CI tests for this PR
-
[ciflow:notebooks](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:notebooks)
- Run Jupyter notebooks execution tests for bionemo2
-
[ciflow:slow](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:slow)
- Run slow single GPU integration tests marked as @pytest.mark.slow for
bionemo2
-
[ciflow:all](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:all)
- Run all tests (unit tests, slow tests, and notebooks) for bionemo2.
This label can be used to enforce running tests for all bionemo2.
-
[ciflow:all-recipes](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:all-recipes)
- Run tests for all recipes (under bionemo-recipes). This label can be
used to enforce running tests for all recipes.

Unit tests marked as `@pytest.mark.multi_gpu` or
`@pytest.mark.distributed` are not run in the PR pipeline.

For more details, see [CONTRIBUTING](CONTRIBUTING.md)

> [!NOTE]
> By default, only basic unit tests are run. Add appropriate labels to
enable an additional test coverage.

#### Authorizing CI Runs

We use
[copy-pr-bot](https://docs.gha-runners.nvidia.com/apps/copy-pr-bot/#automation)
to manage authorization of CI
runs on NVIDIA's compute resources.

- If a pull request is opened by a trusted user and contains only
trusted changes, the pull request's code will
automatically be copied to a pull-request/ prefixed branch in the source
repository (e.g. pull-request/123)
- If a pull request is opened by an untrusted user or contains untrusted
changes, an NVIDIA org member must leave an
`/ok to test` comment on the pull request to trigger CI. This will need
to be done for each new commit.

### Pre-submit Checklist

<!--- Ensure all items are completed before submitting -->

- [ ] I have tested these changes locally
- [ ] I have updated the documentation accordingly
- [ ] I have added/updated tests as needed
- [ ] All existing tests pass successfully

Signed-off-by: Timur Rvachov <[email protected]>
Signed-off-by: ilan-gold <[email protected]>
)

### Description

<!-- Provide a detailed description of the changes in this PR -->

#### Usage

<!--- How does a user interact with the changed code -->

```python
TODO: Add code snippet
```

### Type of changes

<!-- Mark the relevant option with an [x] -->

- [ ] Bug fix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Refactor
- [ ] Documentation update
- [ ] Other (please describe):

### CI Pipeline Configuration

Configure CI behavior by applying the relevant labels. By default, only
basic unit tests are run.

-
[ciflow:skip](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:skip)
- Skip all CI tests for this PR
-
[ciflow:notebooks](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:notebooks)
- Run Jupyter notebooks execution tests for bionemo2
-
[ciflow:slow](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:slow)
- Run slow single GPU integration tests marked as @pytest.mark.slow for
bionemo2
-
[ciflow:all](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:all)
- Run all tests (unit tests, slow tests, and notebooks) for bionemo2.
This label can be used to enforce running tests for all bionemo2.
-
[ciflow:all-recipes](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:all-recipes)
- Run tests for all recipes (under bionemo-recipes). This label can be
used to enforce running tests for all recipes.

Unit tests marked as `@pytest.mark.multi_gpu` or
`@pytest.mark.distributed` are not run in the PR pipeline.

For more details, see [CONTRIBUTING](CONTRIBUTING.md)

> [!NOTE]
> By default, only basic unit tests are run. Add appropriate labels to
enable an additional test coverage.

#### Authorizing CI Runs

We use
[copy-pr-bot](https://docs.gha-runners.nvidia.com/apps/copy-pr-bot/#automation)
to manage authorization of CI
runs on NVIDIA's compute resources.

- If a pull request is opened by a trusted user and contains only
trusted changes, the pull request's code will
automatically be copied to a pull-request/ prefixed branch in the source
repository (e.g. pull-request/123)
- If a pull request is opened by an untrusted user or contains untrusted
changes, an NVIDIA org member must leave an
`/ok to test` comment on the pull request to trigger CI. This will need
to be done for each new commit.

### Pre-submit Checklist

<!--- Ensure all items are completed before submitting -->

- [ ] I have tested these changes locally
- [ ] I have updated the documentation accordingly
- [ ] I have added/updated tests as needed
- [ ] All existing tests pass successfully

---------

Signed-off-by: Jonathan Mitchell <[email protected]>
Signed-off-by: ilan-gold <[email protected]>
### Description

rebase on
https://github.com/NVIDIA-Digital-Bio/CodonFM/blob/main/notebooks/3-Zero-Shot-Mutation-Variant-Clinvar-Synonymous.ipynb

#### Usage

<!--- How does a user interact with the changed code -->

```python
TODO: Add code snippet
```

### Type of changes

<!-- Mark the relevant option with an [x] -->

- [x] Bug fix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Refactor
- [x] Documentation update
- [ ] Other (please describe):

### CI Pipeline Configuration

Configure CI behavior by applying the relevant labels. By default, only
basic unit tests are run.

-
[ciflow:skip](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:skip)
- Skip all CI tests for this PR
-
[ciflow:notebooks](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:notebooks)
- Run Jupyter notebooks execution tests for bionemo2
-
[ciflow:slow](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:slow)
- Run slow single GPU integration tests marked as @pytest.mark.slow for
bionemo2
-
[ciflow:all](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:all)
- Run all tests (unit tests, slow tests, and notebooks) for bionemo2.
This label can be used to enforce running tests for all bionemo2.
-
[ciflow:all-recipes](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:all-recipes)
- Run tests for all recipes (under bionemo-recipes). This label can be
used to enforce running tests for all recipes.

Unit tests marked as `@pytest.mark.multi_gpu` or
`@pytest.mark.distributed` are not run in the PR pipeline.

For more details, see [CONTRIBUTING](CONTRIBUTING.md)

> [!NOTE]
> By default, only basic unit tests are run. Add appropriate labels to
enable an additional test coverage.

#### Authorizing CI Runs

We use
[copy-pr-bot](https://docs.gha-runners.nvidia.com/apps/copy-pr-bot/#automation)
to manage authorization of CI
runs on NVIDIA's compute resources.

- If a pull request is opened by a trusted user and contains only
trusted changes, the pull request's code will
automatically be copied to a pull-request/ prefixed branch in the source
repository (e.g. pull-request/123)
- If a pull request is opened by an untrusted user or contains untrusted
changes, an NVIDIA org member must leave an
`/ok to test` comment on the pull request to trigger CI. This will need
to be done for each new commit.

### Pre-submit Checklist

<!--- Ensure all items are completed before submitting -->

- [x] I have tested these changes locally
- [ ] I have updated the documentation accordingly
- [ ] I have added/updated tests as needed
- [ ] All existing tests pass successfully

Signed-off-by: Yang Zhang <[email protected]>
Signed-off-by: ilan-gold <[email protected]>
### Description

<!-- Provide a detailed description of the changes in this PR -->

#### Usage

<!--- How does a user interact with the changed code -->

```python
TODO: Add code snippet
```

### Type of changes

<!-- Mark the relevant option with an [x] -->

- [ ] Bug fix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Refactor
- [ ] Documentation update
- [ ] Other (please describe):

### CI Pipeline Configuration

Configure CI behavior by applying the relevant labels. By default, only
basic unit tests are run.

-
[ciflow:skip](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:skip)
- Skip all CI tests for this PR
-
[ciflow:notebooks](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:notebooks)
- Run Jupyter notebooks execution tests for bionemo2
-
[ciflow:slow](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:slow)
- Run slow single GPU integration tests marked as @pytest.mark.slow for
bionemo2
-
[ciflow:all](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:all)
- Run all tests (unit tests, slow tests, and notebooks) for bionemo2.
This label can be used to enforce running tests for all bionemo2.
-
[ciflow:all-recipes](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:all-recipes)
- Run tests for all recipes (under bionemo-recipes). This label can be
used to enforce running tests for all recipes.

Unit tests marked as `@pytest.mark.multi_gpu` or
`@pytest.mark.distributed` are not run in the PR pipeline.

For more details, see [CONTRIBUTING](CONTRIBUTING.md)

> [!NOTE]
> By default, only basic unit tests are run. Add appropriate labels to
enable an additional test coverage.

#### Authorizing CI Runs

We use
[copy-pr-bot](https://docs.gha-runners.nvidia.com/apps/copy-pr-bot/#automation)
to manage authorization of CI
runs on NVIDIA's compute resources.

- If a pull request is opened by a trusted user and contains only
trusted changes, the pull request's code will
automatically be copied to a pull-request/ prefixed branch in the source
repository (e.g. pull-request/123)
- If a pull request is opened by an untrusted user or contains untrusted
changes, an NVIDIA org member must leave an
`/ok to test` comment on the pull request to trigger CI. This will need
to be done for each new commit.

### Pre-submit Checklist

<!--- Ensure all items are completed before submitting -->

- [ ] I have tested these changes locally
- [ ] I have updated the documentation accordingly
- [ ] I have added/updated tests as needed
- [ ] All existing tests pass successfully

Signed-off-by: Jonathan Mitchell <[email protected]>
Signed-off-by: ilan-gold <[email protected]>
…ks (NVIDIA#1283)

### Description

* Adding accuracy analysis table to README.md of evo2 submodule.
* Adding `evo2/40b-1m-fp8-bf16:1.0` resource to `load` and
`download_bionemo_data`

#### Usage
On the CLI:
```bash
CKPT_PATH=$(download_bionemo_data evo2/40b-1m-fp8-bf16:1.0)
```

In code:
```python
from bionemo.core.data.load import load
ckpt_path = load("evo2/40b-1m-fp8-bf16:1.0")
```

#### Verifiction:
1. Manually and temporarily replace `nvidia` with `nvstaging` in the ngc
path in the evo2.yaml since the link is not yet public:

```yaml
- tag: 40b-1m-fp8-bf16:1.0
  ngc: nvstaging/clara/evo2-40b-1m-fp8-bf16-nemo2:1.0
```
2. run the download command and see if it's successful (checks most of
the URL, as well as MD5sums etc):
```bash
CKPT_PATH=$(download_bionemo_data evo2/40b-1m-fp8-bf16:1.0)
```
Returns:
```bash
Downloading data from 'nvstaging/clara/evo2-40b-1m-fp8-bf16-nemo2:1.0' to file '/home/ubuntu/.cache/bionemo/544b47e033d1fb0261b686a53f7c4fe240cd290253187d31e8c99dea9e35a680-evo2_40b_bf16_finetune_wandb_Ji2IRcrz_step_119.tar.gz'.
{
    "download_end": "2025-10-27 23:00:34",
    "download_start": "2025-10-27 22:40:22",
    "download_time": "20m 12s",
    "files_downloaded": 1,
    "local_path": "/home/ubuntu/.cache/bionemo/tmp9tdgbowq/evo2-40b-1m-fp8-bf16-nemo2_v1.0",
    "size_downloaded": "59.31 GB",
    "status": "COMPLETED"
}
Untarring contents of '/home/ubuntu/.cache/bionemo/544b47e033d1fb0261b686a53f7c4fe240cd290253187d31e8c99dea9e35a680-evo2_40b_bf16_finetune_wandb_Ji2IRcrz_step_119.tar.gz' to '/home/ubuntu/.cache/bionemo/544b47e033d1fb0261b686a53f7c4fe240cd290253187d31e8c99dea9e35a680-evo2_40b_bf16_finetune_wandb_Ji2IRcrz_step_119.tar.gz.untar'
```

### Type of changes

- [ ] Bug fix (non-breaking change which fixes an issue)
- [x] New feature (non-breaking change which adds functionality)
- [ ] Refactor
- [x] Documentation update
- [ ] Other (please describe):

### CI Pipeline Configuration

Configure CI behavior by applying the relevant labels. By default, only
basic unit tests are run.

-
[ciflow:skip](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:skip)
- Skip all CI tests for this PR
-
[ciflow:notebooks](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:notebooks)
- Run Jupyter notebooks execution tests for bionemo2
-
[ciflow:slow](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:slow)
- Run slow single GPU integration tests marked as @pytest.mark.slow for
bionemo2
-
[ciflow:all](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:all)
- Run all tests (unit tests, slow tests, and notebooks) for bionemo2.
This label can be used to enforce running tests for all bionemo2.
-
[ciflow:all-recipes](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:all-recipes)
- Run tests for all recipes (under bionemo-recipes). This label can be
used to enforce running tests for all recipes.

Unit tests marked as `@pytest.mark.multi_gpu` or
`@pytest.mark.distributed` are not run in the PR pipeline.

For more details, see [CONTRIBUTING](CONTRIBUTING.md)

> [!NOTE]
> By default, only basic unit tests are run. Add appropriate labels to
enable an additional test coverage.

#### Authorizing CI Runs

We use
[copy-pr-bot](https://docs.gha-runners.nvidia.com/apps/copy-pr-bot/#automation)
to manage authorization of CI
runs on NVIDIA's compute resources.

- If a pull request is opened by a trusted user and contains only
trusted changes, the pull request's code will
automatically be copied to a pull-request/ prefixed branch in the source
repository (e.g. pull-request/123)
- If a pull request is opened by an untrusted user or contains untrusted
changes, an NVIDIA org member must leave an
`/ok to test` comment on the pull request to trigger CI. This will need
to be done for each new commit.

### Pre-submit Checklist

<!--- Ensure all items are completed before submitting -->

- [ ] I have tested these changes locally
- [ ] I have updated the documentation accordingly
- [ ] I have added/updated tests as needed
- [ ] All existing tests pass successfully

---------

Signed-off-by: John St John <[email protected]>
Signed-off-by: ilan-gold <[email protected]>
### Description

<!-- Provide a detailed description of the changes in this PR -->

#### Usage

<!--- How does a user interact with the changed code -->

```python
TODO: Add code snippet
```

### Type of changes

<!-- Mark the relevant option with an [x] -->

- [ ] Bug fix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Refactor
- [x] Documentation update
- [ ] Other (please describe):

### CI Pipeline Configuration

Configure CI behavior by applying the relevant labels. By default, only
basic unit tests are run.

-
[ciflow:skip](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:skip)
- Skip all CI tests for this PR
-
[ciflow:notebooks](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:notebooks)
- Run Jupyter notebooks execution tests for bionemo2
-
[ciflow:slow](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:slow)
- Run slow single GPU integration tests marked as @pytest.mark.slow for
bionemo2
-
[ciflow:all](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:all)
- Run all tests (unit tests, slow tests, and notebooks) for bionemo2.
This label can be used to enforce running tests for all bionemo2.
-
[ciflow:all-recipes](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:all-recipes)
- Run tests for all recipes (under bionemo-recipes). This label can be
used to enforce running tests for all recipes.

Unit tests marked as `@pytest.mark.multi_gpu` or
`@pytest.mark.distributed` are not run in the PR pipeline.

For more details, see [CONTRIBUTING](CONTRIBUTING.md)

> [!NOTE]
> By default, only basic unit tests are run. Add appropriate labels to
enable an additional test coverage.

#### Authorizing CI Runs

We use
[copy-pr-bot](https://docs.gha-runners.nvidia.com/apps/copy-pr-bot/#automation)
to manage authorization of CI
runs on NVIDIA's compute resources.

- If a pull request is opened by a trusted user and contains only
trusted changes, the pull request's code will
automatically be copied to a pull-request/ prefixed branch in the source
repository (e.g. pull-request/123)
- If a pull request is opened by an untrusted user or contains untrusted
changes, an NVIDIA org member must leave an
`/ok to test` comment on the pull request to trigger CI. This will need
to be done for each new commit.

### Pre-submit Checklist

<!--- Ensure all items are completed before submitting -->

- [ ] I have tested these changes locally
- [ ] I have updated the documentation accordingly
- [ ] I have added/updated tests as needed
- [ ] All existing tests pass successfully

Signed-off-by: Yang Zhang <[email protected]>
Signed-off-by: ilan-gold <[email protected]>
…adme (NVIDIA#1298)

### Description

Inside tasks.py the following line exists for creating folders.
```
if not os.path.exists(out_dir):
        os.makedirs(out_dir)
```
However, if you have a multi-node system running, this may happen.
```
Process 0 checks os.path.exists(out_dir) → returns False
Process 1 checks os.path.exists(out_dir) → returns False
Process 0 calls os.makedirs(out_dir) → succeeds
Process 1 calls os.makedirs(out_dir) → fails with FileExistsError
```
Thus, the solution here is to use os.makedirs(out_dir, exist_ok=True)

#### Usage

<!--- How does a user interact with the changed code -->

```python
TODO: Add code snippet
```

### Type of changes

<!-- Mark the relevant option with an [x] -->

- [ ] Bug fix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Refactor
- [ ] Documentation update
- [ ] Other (please describe):

### CI Pipeline Configuration

Configure CI behavior by applying the relevant labels. By default, only
basic unit tests are run.

-
[ciflow:skip](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:skip)
- Skip all CI tests for this PR
-
[ciflow:notebooks](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:notebooks)
- Run Jupyter notebooks execution tests for bionemo2
-
[ciflow:slow](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:slow)
- Run slow single GPU integration tests marked as @pytest.mark.slow for
bionemo2
-
[ciflow:all](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:all)
- Run all tests (unit tests, slow tests, and notebooks) for bionemo2.
This label can be used to enforce running tests for all bionemo2.
-
[ciflow:all-recipes](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:all-recipes)
- Run tests for all recipes (under bionemo-recipes). This label can be
used to enforce running tests for all recipes.

Unit tests marked as `@pytest.mark.multi_gpu` or
`@pytest.mark.distributed` are not run in the PR pipeline.

For more details, see [CONTRIBUTING](CONTRIBUTING.md)

> [!NOTE]
> By default, only basic unit tests are run. Add appropriate labels to
enable an additional test coverage.

#### Authorizing CI Runs

We use
[copy-pr-bot](https://docs.gha-runners.nvidia.com/apps/copy-pr-bot/#automation)
to manage authorization of CI
runs on NVIDIA's compute resources.

- If a pull request is opened by a trusted user and contains only
trusted changes, the pull request's code will
automatically be copied to a pull-request/ prefixed branch in the source
repository (e.g. pull-request/123)
- If a pull request is opened by an untrusted user or contains untrusted
changes, an NVIDIA org member must leave an
`/ok to test` comment on the pull request to trigger CI. This will need
to be done for each new commit.

### Pre-submit Checklist

<!--- Ensure all items are completed before submitting -->

- [ ] I have tested these changes locally
- [ ] I have updated the documentation accordingly
- [ ] I have added/updated tests as needed
- [ ] All existing tests pass successfully

---------

Signed-off-by: Jonathan Mitchell <[email protected]>
Signed-off-by: ilan-gold <[email protected]>
@ilan-gold ilan-gold requested a review from tshimko-nv as a code owner October 31, 2025 16:30
Copy link
Collaborator

@polinabinder1 polinabinder1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Point 1. That’s a good catch
Point 2. There is the option of converting it each time vs. just using a created dataset. So for the creation of a SCDL dataset, we have a create_scdl_dataset_and_loader_factory. But we could also split it up into create_scdl_from_anndata (this would create a scdl dataset on disk from adnndata) and create_scdl_dataloader_factory (which would create a dataloader from that dataset)
Points 3. /4 The issue here is that we don’t benchmark on GPU currently. It doesn’t look like ZarrSparseDataset explicitly depends on there should be a fallback option that works on CPU.

Point 5. That’s a good idea - I have experimented with adding RAM pressure + running benchmarking. It could be interesting to allow for that pressure within the framework.

Could you share some of the benchmarking results?

I also want to sanity check that the results look reasonable to you. I really appreciate this PR.

@polinabinder1
Copy link
Collaborator

/ok to test 90d44de

@ilan-gold
Copy link
Author

There is the option of converting it each time vs. just using a created dataset. So for the creation of a SCDL dataset, we have a create_scdl_dataset_and_loader_factory. But we could also split it up into create_scdl_from_anndata (this would create a scdl dataset on disk from adnndata) and create_scdl_dataloader_factory (which would create a dataloader from that dataset)

I looked into this and was a little confused - it looks like there is no "conversion" in the other examples, but rather a scdl-path passed in and then it benchmarks the time needed to instantiate the class but not do conversion i.e., dataset = SingleCellMemMapDataset(data_path) does no actual on-disk writing, no? So I separated that but maybe I'm wrong.

The issue here is that we don’t benchmark on GPU currently. It doesn’t look like ZarrSparseDataset explicitly depends on there should be a fallback option that works on CPU.

Ok, and my colleague just noticed we assume you have cupy installed by default (i..e, the default settings in ZarrSparseDataset use it). So we'll fix that but it's explicitly set to False no. Ideally, when run on a GPU, we would have it installed.

Point 5. That’s a good idea - I have experimented with adding RAM pressure + running benchmarking. It could be interesting to allow for that pressure within the framework.

Yea, this is the only way for me to develop since I don't have the time/disk space on my linux machine to get the full dataset, but I'm going to change that soon!

Here are the results - I added cache pressure because, like I said, I've only really looked at it with cache pressure. As you can see, the dataset is 18GB on disk, so about ~100GB uncompressed and I only gave myself 10GB of free RAM. I didn't turn on direct_io, but I also don't think it's such a big deal with such a small chunk size (in fact, probably harmful) - my hunch is that it only gets to be useful once the block_size goes really big like 512 (which is probably not a good idea unless your data is shuffled on-disk 😉). Also, very interested in the num_workers grid search. So grid searching over both of those would be super cool i.e., use_direct_io on and off, and then `num_workers.

Here are the results, which broadly make sense to me:
annbatch_benchmark_20251104_121014_detailed_breakdown.csv

@polinabinder1
Copy link
Collaborator

Here's an example of data conversion: https://github.com/NVIDIA/bionemo-framework/blob/pbinder/benchmark_conversion_example/sub-packages/bionemo-scspeedtest/examples/scdl_conversion_example.py

Could you add the specs of your machine/ share a comparison to SCDL? Also, could you share the dataset that this is on? We have seen a lot of variability in our benchmarking work based on the machine, so I would be excited to play with this.

@ilan-gold
Copy link
Author

ilan-gold commented Nov 5, 2025

lscpu

gives

Architecture:                x86_64
  CPU op-mode(s):            32-bit, 64-bit
  Address sizes:             48 bits physical, 48 bits virtual
  Byte Order:                Little Endian
CPU(s):                      16
  On-line CPU(s) list:       0-15
Vendor ID:                   AuthenticAMD
  Model name:                AMD EPYC-Rome Processor
    CPU family:              23
    Model:                   49
    Thread(s) per core:      1
    Core(s) per socket:      1
    Socket(s):               16
    Stepping:                0
    BogoMIPS:                5988.74
    Flags:                   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl
                              cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lah
                             f_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd ibrs ibpb stibp vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi
                             2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr wbnoinvd arat umip rdpid arch_capabilities
Virtualization features:     
  Hypervisor vendor:         KVM
  Virtualization type:       full
Caches (sum of all):         
  L1d:                       512 KiB (16 instances)
  L1i:                       512 KiB (16 instances)
  L2:                        8 MiB (16 instances)
  L3:                        256 MiB (16 instances)
NUMA:                        
  NUMA node(s):              1
  NUMA node0 CPU(s):         0-15
Vulnerabilities:             
  Gather data sampling:      Not affected
  Indirect target selection: Not affected
  Itlb multihit:             Not affected
  L1tf:                      Not affected
  Mds:                       Not affected
  Meltdown:                  Not affected
  Mmio stale data:           Not affected
  Reg file data sampling:    Not affected
  Retbleed:                  Mitigation; untrained return thunk; SMT disabled
  Spec rstack overflow:      Mitigation; SMT disabled
  Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl and seccomp
  Spectre v1:                Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:                Mitigation; Retpolines; IBPB conditional; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
  Srbds:                     Not affected
  Tsa:                       Not affected
  Tsx async abort:           Not affected

I don't know about the SSD this is run on. I could enquire, but it's just a provisioned machine from https://cloud.denbi.de/wiki/ so shouldn't be anything crazy. That being said, it appears faster than the average sagemaker instance. So I definitely understand you about hardware differences, but we haven't seen any relative change in performance i.e., one method is 10X faster than another on machine, but only 5X on another. Maybe that could change, would be interested to find out!

The data is a 6 million cell subset of tahoe. I could try generating an scdl dataset, but I don't have a script. If you were to move that script into main I would be happy to run it.

I added a measure-collection-creation-time option to the CLI instead of making a separate script. Hope that's ok. I get what seem to be reasonable numbers for dataset creation time when it is enabled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants