Skip to content

Add new project proposal to describe nvlink + topology aware scheduling#211

Merged
ecolternv merged 4 commits intomainfrom
ecolter/nvlink-project-design
Jan 27, 2026
Merged

Add new project proposal to describe nvlink + topology aware scheduling#211
ecolternv merged 4 commits intomainfrom
ecolter/nvlink-project-design

Conversation

@ecolternv
Copy link
Contributor

Description

Add project design document for nvlink + topology aware scheduling support

Issue #206

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@ecolternv ecolternv requested a review from a team January 8, 2026 21:36
@ecolternv ecolternv force-pushed the ecolter/nvlink-project-design branch from 8ba0fce to 8ff847e Compare January 14, 2026 20:29
@RyaliNvidia
Copy link
Contributor

This is a more general question regarding the feature, but once the feature is there, how will you verify that NVLink is performing as expected? What kind of tests can you do to verify that NVLink is actually being used and whether it is performing as well as expected? It could be that there isn't an issue with the implementation and that NVLink is not performing well, but it would be nice to know that.

@RyaliNvidia
Copy link
Contributor

RyaliNvidia commented Jan 15, 2026

Is NVLink a feature we want users to opt into so that even if the feature is there, they can specify not to use it? This is because they might want to build confidence that NVLink is improving their performance so they would want to run to side by side tests with NVLink there and not.

This way, we could get a KPI on how many people are switching to using NVLink and the usefulness of this feature once deployed

@RyaliNvidia RyaliNvidia reopened this Jan 15, 2026
@RyaliNvidia
Copy link
Contributor

Another thing to consider is if we can change the resources list to show the zones/racks/whatnot visually so users know why their workflow isn't scheduling even though there are resources available.

@github-actions
Copy link

github-actions bot commented Jan 26, 2026

PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-01-27 18:39 UTC

@ecolternv
Copy link
Contributor Author

This is a more general question regarding the feature, but once the feature is there, how will you verify that NVLink is performing as expected? What kind of tests can you do to verify that NVLink is actually being used and whether it is performing as well as expected? It could be that there isn't an issue with the implementation and that NVLink is not performing well, but it would be nice to know that.

Yeah, we need a more comprehensive test workflow that is sensitive to network speed. Perhaps we can use https://github.com/NVIDIA/nccl-tests instead of doing model inference/training.

I added an open item for this

@ecolternv
Copy link
Contributor Author

Another thing to consider is if we can change the resources list to show the zones/racks/whatnot visually so users know why their workflow isn't scheduling even though there are resources available.

Good idea, but it will take some design to do this right. I've added an open item for this in the topology-aware-scheduling doc

@ecolternv
Copy link
Contributor Author

Is NVLink a feature we want users to opt into so that even if the feature is there, they can specify not to use it? This is because they might want to build confidence that NVLink is improving their performance so they would want to run to side by side tests with NVLink there and not.

This way, we could get a KPI on how many people are switching to using NVLink and the usefulness of this feature once deployed

You can opt not to use it by setting NCCL_MNNVL_ENABLE=0. In that case the compute domain just wont be used, which shouldn't cause any problems as far as I'm aware.

@ecolternv ecolternv merged commit af91da4 into main Jan 27, 2026
6 checks passed
@ecolternv ecolternv deleted the ecolter/nvlink-project-design branch January 27, 2026 18:38
RyaliNvidia added a commit that referenced this pull request Jan 28, 2026
* allow flexible squid proxy replicas (#241)

* allow flexible squid proxy replicas

* fix

* Efficient Workflow Cleanup through Using Async Operations for Log Migration (#167)

* Improving Performance for Uploading Workflow Artifacts in Worker Jobs

* Cleanup

* Add progress writing after upload

* Add dependency in Bazel BUILD

* Add type to mypy requirements

* Update mypy requirements

* Add to mypy_cli BUILD

* Fix lint

* Comment

* Use constant to define semaphor and storage client executor count

* #244 - Use last login url if url is not specified (#245)

* Use last login url if url is not specified

* print message

* Cannot select any text inside modals or slideouts (#248)

* Video html element not changin when selecting different video files in the UI for OSMO dataset (#249)

* sync-feature-branches: fix no conflict case, allow single branch to be synced (#252)

* Fix sync-feature-branches with no merge conflicts

* Allow a single branch to be specified for sync-feature-branches

* Perform operations as OSMO CI Bot

* Add external label when the PR is created

* extract issue number

* add test cases (#247)

* Allow PR checks to run on release branches (#264)

* Database Pooling in Postgres Singleton Across Services (#251)

* Initial commit for database pooling

* Update set_session

* Fix lint

* Update PostgresConnector to have semaphor to control connections

* Lint fix

* Fix number of maxconn for test

* Address comments

* Add Go Postgres utils (#272)

* #148 - Auth Project Design Documents (#165)

* add args to postgres (#282)

* #267 - cloud deployment scripts (#268)

* script to create azure resources and deploy

* Remove auto-generated values files from tracking

- Added .gitignore to ignore values/, *.env files
- Removed values/*.yaml files from git (auto-generated during deployment)

* add aws script

* add aws script

* add copyright

* update copyright

* Support for Azure workload identity in AKS and Arc clusters (#141)

* feat(src): add Azure service account and extra pod labels configuration

- implement service account creation with customizable name and annotations
- enhance service templates to support extra pod labels for various services
- update Azure backend to utilize DefaultAzureCredential for authentication
- add tests for Azure credential extraction and client creation

* feat(src): extract account key from connection string for Azure Blob Storage

- add function to extract AccountKey from connection string
- update AzureBlobStorageClient to handle different credential types

* feat(test): add tests for account key extraction from Azure connection strings

* chore: clean up linting issues for tests

* refactor(src): update data credential types in PostgresConnector and TaskGroup

- change StaticDataCredential to DataCredential in get_all_data_creds method
- update fetch_creds function signature to use DataCredential

* feat(src): update Azure client creation to include storage account and account URL

- remove deprecated storage account extraction function
- modify create_client to accept storage_account and account_url parameters
- update AzureBlobStorageClientFactory to use new parameters
- adjust tests to reflect changes in client creation

🔒 - Generated by Copilot

* refactor(src): mark storage_account parameter as unused in create_client function

🔧 - Generated by Copilot

* refactor(src): remove unused storage_account parameter from client creation

🔧 - Generated by Copilot

* Add new project proposal to describe nvlink + topology aware scheduling (#211)

* Add new project proposal to describe nvlink + topology aware scheduling

* Split design into two docs

* Finish docs and add some updates from feedback

* Add some open items

* OSMO-6044: Application error when closing Task Details after switching Events view from Task to Workflow (#315)

* add redis utlis, update postgres utils (#313)

* add redis utlis, update postgres utils

* add deps

* Fix missing seperator in the test runner roles (#320)

* fix

* remove

* fix

---------

Co-authored-by: Vivian Pan <vivianp@nvidia.com>
Co-authored-by: ethany-nv <ethany@nvidia.com>
Co-authored-by: RyaliNvidia <ryali@nvidia.com>
Co-authored-by: patclarknvidia <patc@nvidia.com>
Co-authored-by: Ethan Look-Potts <elookpotts@nvidia.com>
Co-authored-by: xutongNV <xutongr@nvidia.com>
Co-authored-by: Allen Greaves <111466195+agreaves-ms@users.noreply.github.com>
Co-authored-by: ecolternv <ecolter@nvidia.com>
Co-authored-by: tdewanNvidia <tdewan@nvidia.com>
fernandol-nvidia pushed a commit that referenced this pull request Jan 29, 2026
…ng (#211)

* Add new project proposal to describe nvlink + topology aware scheduling

* Split design into two docs

* Finish docs and add some updates from feedback

* Add some open items
xutongNV added a commit that referenced this pull request Feb 3, 2026
* Update the wording re: creating feature branches (#204)

* Add a link back to OSMO from the brev launchable (#205)

* Improve styling for badges in the brev launchable readme (#207)

* Fix osmo config pool update payload in backend installation docs (#210)

* Fix osmo config pool update payload in practical guide (#213)

* #147 - backend operator redesign doc (#149)

* backend operator redesign doc

* 195 - Bump quick-start version due to updated dependencies (#217)

* Perform Client Side Data Auth Check In the Event of Environment Based Auth (#177)

* Data/Dataset Auth Check CLIs

* Remove auth check from data service

* Use auth check CLIs in ctrl

* Add exit code to docs

* Fix build issues

* Fix lint

* Ctrl to use user config when validating data auth

* Use the correct CLI argument type

* Fix lint

* Use profile when looking up data credential from config

* Update quick start installation to always install latest version (#218)

* Add workflow to label external issues and pull requests (#222)

* Add workflow to label external issues and pull requests

* pin to allowed action version

* add reopened event

* allow flexible squid proxy replicas (#241)

* allow flexible squid proxy replicas

* fix

* Efficient Workflow Cleanup through Using Async Operations for Log Migration (#167)

* Improving Performance for Uploading Workflow Artifacts in Worker Jobs

* Cleanup

* Add progress writing after upload

* Add dependency in Bazel BUILD

* Add type to mypy requirements

* Update mypy requirements

* Add to mypy_cli BUILD

* Fix lint

* Comment

* Use constant to define semaphor and storage client executor count

* #244 - Use last login url if url is not specified (#245)

* Use last login url if url is not specified

* print message

* Cannot select any text inside modals or slideouts (#248)

* Video html element not changin when selecting different video files in the UI for OSMO dataset (#249)

* sync-feature-branches: fix no conflict case, allow single branch to be synced (#252)

* Fix sync-feature-branches with no merge conflicts

* Allow a single branch to be specified for sync-feature-branches

* Perform operations as OSMO CI Bot

* Add external label when the PR is created

* extract issue number

* add test cases (#247)

* Allow PR checks to run on release branches (#264)

* Database Pooling in Postgres Singleton Across Services (#251)

* Initial commit for database pooling

* Update set_session

* Fix lint

* Update PostgresConnector to have semaphor to control connections

* Lint fix

* Fix number of maxconn for test

* Address comments

* Add Go Postgres utils (#272)

* #148 - Auth Project Design Documents (#165)

* add args to postgres (#282)

* #267 - cloud deployment scripts (#268)

* script to create azure resources and deploy

* Remove auto-generated values files from tracking

- Added .gitignore to ignore values/, *.env files
- Removed values/*.yaml files from git (auto-generated during deployment)

* add aws script

* add aws script

* add copyright

* update copyright

* Support for Azure workload identity in AKS and Arc clusters (#141)

* feat(src): add Azure service account and extra pod labels configuration

- implement service account creation with customizable name and annotations
- enhance service templates to support extra pod labels for various services
- update Azure backend to utilize DefaultAzureCredential for authentication
- add tests for Azure credential extraction and client creation

* feat(src): extract account key from connection string for Azure Blob Storage

- add function to extract AccountKey from connection string
- update AzureBlobStorageClient to handle different credential types

* feat(test): add tests for account key extraction from Azure connection strings

* chore: clean up linting issues for tests

* refactor(src): update data credential types in PostgresConnector and TaskGroup

- change StaticDataCredential to DataCredential in get_all_data_creds method
- update fetch_creds function signature to use DataCredential

* feat(src): update Azure client creation to include storage account and account URL

- remove deprecated storage account extraction function
- modify create_client to accept storage_account and account_url parameters
- update AzureBlobStorageClientFactory to use new parameters
- adjust tests to reflect changes in client creation

🔒 - Generated by Copilot

* refactor(src): mark storage_account parameter as unused in create_client function

🔧 - Generated by Copilot

* refactor(src): remove unused storage_account parameter from client creation

🔧 - Generated by Copilot

* Add new project proposal to describe nvlink + topology aware scheduling (#211)

* Add new project proposal to describe nvlink + topology aware scheduling

* Split design into two docs

* Finish docs and add some updates from feedback

* Add some open items

* OSMO-6044: Application error when closing Task Details after switching Events view from Task to Workflow (#315)

* add redis utlis, update postgres utils (#313)

* add redis utlis, update postgres utils

* add deps

* Fix missing seperator in the test runner roles (#320)

* show backend name in scheduler validation error message (#323)

* #220 - Design documentation for dynamic subpool (#221)

* Initial design spike for dynamic subpool

* Add more context to design

* Address feedback

* resolve conflict

* fix ui

---------

Co-authored-by: Ethan Look-Potts <elookpotts@nvidia.com>
Co-authored-by: xutongNV <xutongr@nvidia.com>
Co-authored-by: Fernando L <fernandol@nvidia.com>
Co-authored-by: Vivian Pan <vivianp@nvidia.com>
Co-authored-by: ethany-nv <ethany@nvidia.com>
Co-authored-by: RyaliNvidia <ryali@nvidia.com>
Co-authored-by: patclarknvidia <patc@nvidia.com>
Co-authored-by: Allen Greaves <111466195+agreaves-ms@users.noreply.github.com>
Co-authored-by: ecolternv <ecolter@nvidia.com>
Co-authored-by: tdewanNvidia <tdewan@nvidia.com>
RyaliNvidia added a commit that referenced this pull request Feb 18, 2026
* allow flexible squid proxy replicas (#241)

* allow flexible squid proxy replicas

* fix

* Efficient Workflow Cleanup through Using Async Operations for Log Migration (#167)

* Improving Performance for Uploading Workflow Artifacts in Worker Jobs

* Cleanup

* Add progress writing after upload

* Add dependency in Bazel BUILD

* Add type to mypy requirements

* Update mypy requirements

* Add to mypy_cli BUILD

* Fix lint

* Comment

* Use constant to define semaphor and storage client executor count

* #244 - Use last login url if url is not specified (#245)

* Use last login url if url is not specified

* print message

* Cannot select any text inside modals or slideouts (#248)

* Video html element not changin when selecting different video files in the UI for OSMO dataset (#249)

* sync-feature-branches: fix no conflict case, allow single branch to be synced (#252)

* Fix sync-feature-branches with no merge conflicts

* Allow a single branch to be specified for sync-feature-branches

* Perform operations as OSMO CI Bot

* Add external label when the PR is created

* extract issue number

* add test cases (#247)

* Allow PR checks to run on release branches (#264)

* Database Pooling in Postgres Singleton Across Services (#251)

* Initial commit for database pooling

* Update set_session

* Fix lint

* Update PostgresConnector to have semaphor to control connections

* Lint fix

* Fix number of maxconn for test

* Address comments

* Add Go Postgres utils (#272)

* #148 - Auth Project Design Documents (#165)

* add args to postgres (#282)

* #267 - cloud deployment scripts (#268)

* script to create azure resources and deploy

* Remove auto-generated values files from tracking

- Added .gitignore to ignore values/, *.env files
- Removed values/*.yaml files from git (auto-generated during deployment)

* add aws script

* add aws script

* add copyright

* update copyright

* Support for Azure workload identity in AKS and Arc clusters (#141)

* feat(src): add Azure service account and extra pod labels configuration

- implement service account creation with customizable name and annotations
- enhance service templates to support extra pod labels for various services
- update Azure backend to utilize DefaultAzureCredential for authentication
- add tests for Azure credential extraction and client creation

* feat(src): extract account key from connection string for Azure Blob Storage

- add function to extract AccountKey from connection string
- update AzureBlobStorageClient to handle different credential types

* feat(test): add tests for account key extraction from Azure connection strings

* chore: clean up linting issues for tests

* refactor(src): update data credential types in PostgresConnector and TaskGroup

- change StaticDataCredential to DataCredential in get_all_data_creds method
- update fetch_creds function signature to use DataCredential

* feat(src): update Azure client creation to include storage account and account URL

- remove deprecated storage account extraction function
- modify create_client to accept storage_account and account_url parameters
- update AzureBlobStorageClientFactory to use new parameters
- adjust tests to reflect changes in client creation

🔒 - Generated by Copilot

* refactor(src): mark storage_account parameter as unused in create_client function

🔧 - Generated by Copilot

* refactor(src): remove unused storage_account parameter from client creation

🔧 - Generated by Copilot

* Add new project proposal to describe nvlink + topology aware scheduling (#211)

* Add new project proposal to describe nvlink + topology aware scheduling

* Split design into two docs

* Finish docs and add some updates from feedback

* Add some open items

* OSMO-6044: Application error when closing Task Details after switching Events view from Task to Workflow (#315)

* add redis utlis, update postgres utils (#313)

* add redis utlis, update postgres utils

* add deps

* Fix missing seperator in the test runner roles (#320)

* fix

* remove

* fix

---------

Co-authored-by: Vivian Pan <vivianp@nvidia.com>
Co-authored-by: ethany-nv <ethany@nvidia.com>
Co-authored-by: RyaliNvidia <ryali@nvidia.com>
Co-authored-by: patclarknvidia <patc@nvidia.com>
Co-authored-by: Ethan Look-Potts <elookpotts@nvidia.com>
Co-authored-by: xutongNV <xutongr@nvidia.com>
Co-authored-by: Allen Greaves <111466195+agreaves-ms@users.noreply.github.com>
Co-authored-by: ecolternv <ecolter@nvidia.com>
Co-authored-by: tdewanNvidia <tdewan@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments