Skip to content

Conversation

@mathomp4
Copy link
Member

Tests at NAS shows that single-node jobs with MPT are very slow with restart writes (sometimes 20 minutes at c24!).

The issue is that MPT has trouble with lots of MPI_GatherV calls in a row on a single node. One solution is to add MPI_Barrier calls but that means going deep into MAPL.

A "simpler" solution (no code change needed) is to just use our "write restart by oserver" functionality which doesn't use MPI_GatherV.

So this PR says "if you are a low-res job and running with MPT, just set WRITE_RESTART_BY_OSERVER: YES. It's not perfect, but it's faster than fixing up MAPL for now.


NOTE: @sshakoor1 this will probably need added to the python setup.

@mathomp4 mathomp4 self-assigned this Oct 31, 2025
@mathomp4 mathomp4 requested a review from a team as a code owner October 31, 2025 17:38
@mathomp4 mathomp4 added the 0 diff The changes in this pull request have verified to be zero-diff with the target branch. label Oct 31, 2025
@sdrabenh sdrabenh merged commit 5ca3876 into feature/sdrabenh/gcm_v12 Nov 24, 2025
11 of 13 checks passed
@sdrabenh sdrabenh deleted the feature/v12-mpt-single-node branch November 24, 2025 17:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

0 diff The changes in this pull request have verified to be zero-diff with the target branch.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants