Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Results diff across different MPI numbers #10

Open
jtruesdal opened this issue Jan 23, 2025 · 4 comments
Open

Results diff across different MPI numbers #10

jtruesdal opened this issue Jan 23, 2025 · 4 comments
Labels
bug Something isn't working

Comments

@jtruesdal
Copy link
Collaborator

What happened?

Initial test of 128 and 256 FKESSLER global answers differ to roundoff after about 20 steps. BFB_FLAG is set in namelist.

What are the steps to reproduce the bug?

Standard FKESSLER for NH setup as outlined on StormSPEED project page.

What CAM tag were you using?

nh branch

What machine were you running CAM on?

CISL machine (e.g. cheyenne)

What compiler were you using?

Intel

Path to a case directory, if applicable

/glade/derecho/scratch/sunjian/cam7/case/FKESSLER_StormSPEED.01/

Will you be addressing this bug yourself?

Yes

Extra info

case directories
/glade/derecho/scratch/sunjian/cam7/case/FKESSLER_StormSPEED.01/
/glade/derecho/scratch/sunjian/cam7/case/FKESSLER_StormSPEED.02/

run directories
/glade/derecho/scratch/sunjian/FKESSLER_StormSPEED.01/run/
/glade/derecho/scratch/sunjian/FKESSLER_StormSPEED.02/run/

@jtruesdal jtruesdal added the bug Something isn't working label Jan 23, 2025
@jtruesdal
Copy link
Collaborator Author

@sjsprecious could you run these same 2 tests with DEBUG set to TRUE in env_build.xml. This flag should turn off all optimization (ie -O0). If it doesn't already could you make sure the intel flags for DEBUG add the following options

-fp-model precise -O0 -g -check uninit -check bounds -check pointers -fpe0 -check noarg_temp_created -init=snan,arrays

We also need to check if standalone homme passes this test. Could you run standalone homme FKESSLER (#4 (comment)) with 128 and 256 MPI tasks with normal optimization. If that passes we have our answer. If it is not BFB could you run the same test turning off optimization and using the above set of flags for compiling the code.

@sjsprecious
Copy link
Collaborator

Thanks @jtruesdal .

I have run two tests with DEBUG set to TRUE and using the flags as suggested. The new runs are still not BFB to each other. The output can be found on Derecho at:

  • Debug run with 128 MPI ranks: /glade/derecho/scratch/sunjian/stormspeed/run/FKESSLER.ne16_g37.derecho.intel.debug.128
  • Debug run with 256 MPI ranks: /glade/derecho/scratch/sunjian/stormspeed/run/FKESSLER.ne16_g37.derecho.intel.debug.256

I will update the standalone homme results once they are available.

@jtruesdal
Copy link
Collaborator Author

Thanks Jian. I will start checking the usual suspects.

@sjsprecious
Copy link
Collaborator

Hi @jtruesdal , here is the output of standalone HOMME theta-l dycore on Derecho with intel compiler and the following compiler flags (default optimization level is -O3):

Fortran Flags =  -assume byterecl -fp-model fast -ftz -diag-disable 8291 -O3 -g -qopenmp -traceback
C Flags =  -fp-model fast -ftz -g -fiopenmp 
CXX Flags =  -fp-model fast -ftz -g -g -fiopenmp 
  • 128 MPI ranks: /glade/derecho/scratch/sunjian/e3sm_test/homme/dcmip_tests/dcmip2016_test1_baroclinic_wave/theta-l/movies/r100_dcmip2016_test11_128mpiranks.nc
  • 256 MPI ranks: /glade/derecho/scratch/sunjian/e3sm_test/homme/dcmip_tests/dcmip2016_test1_baroclinic_wave/theta-l/movies/r100_dcmip2016_test11_256mpiranks.nc

The results are BFB to each other.

I am not familiar with HOMME so I hope I change the MPI tasks correctly. The way I change it is to revise the TOTAL_NUM_MPI_TASKS variable and PBS resources in the /glade/u/home/sunjian/e3sm/run_script/derecho/homme/job_submission.sh script for a 128- or 256-MPI run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants