Skip to content

EAMxx: Option to run SHOC with no SGS variability (1.5 closure) #7188

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 37 commits into from
May 19, 2025

Conversation

bogensch
Copy link
Contributor

@bogensch bogensch commented Mar 28, 2025

This PR adds an option to run SHOC with no SGS variability. This essentially reduces SHOC to a 1.5 TKE closure in the vertical.

If this option is activated (shoc_1p5tke= true):

  • The scalar variances and covariances are set to zero.
  • The third moment of vertical velocity is set to zero.
  • The above two assumptions will reduce the assumed PDF to an all-or-nothing closure.
  • Since an all-or-nothing closure is used this means SHOC’s buoyancy flux parameterization is now ill-posed due to the liquid water flux (w'ql') always being zero. The buoyancy flux is needed to close the buoyant production term in the TKE budget. In this case we close the buoyant production of TKE using the local moist brunt vaisalla frequency, which follows many 1.5 TKE closures.
  • Modifies the definition of the eddy diffusivities to align better with formulations used in a 1.5 scheme.
  • Modifies the length scale definition to be exactly that of SAM when a 1.5 TKE closure is activated.

I have verified in several cases, and one ne30 simulation, that if this option is activate that cloud fraction will always be either 0 or 1 using instantaneous output.

Since this scheme is set to false by default, this should be a b4b PR.

@bogensch bogensch added BFB PR leaves answers BFB EAMxx Issues related to EAMxx labels Mar 28, 2025
Copy link
Contributor

@mahf708 mahf708 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor formatting issues, otherwise looks good. Preliminary approval to get the CI to run.

Copy link
Contributor

@AaronDonahue AaronDonahue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I had a question about one of the functions.

@tcclevenger tcclevenger assigned tcclevenger and unassigned bartgol Apr 1, 2025
@bogensch bogensch added the wip work in progress label Apr 9, 2025
@tcclevenger
Copy link
Contributor

Shoc unit tests failing on both gpu and cpu. CIME cases seem fine.

@bogensch
Copy link
Contributor Author

Total bummer summer.

It looks like (from what I can tell) the SHOC property tests are passing but the b4b tests are failing for the routines I modified. Example, they are all failing with the error:

-------------------------------------------------------------------------------
diag_second_moments_bfb
-------------------------------------------------------------------------------
/home/runner/_work/E3SM/E3SM/components/eamxx/src/physics/shoc/tests/shoc_diag_second_moments_tests.cpp:364
...............................................................................

/home/runner/_work/E3SM/E3SM/components/eamxx/src/physics/shoc/tests/shoc_diag_second_moments_tests.cpp:364: FAILED:
due to unexpected exception with message:
  
 FAIL:
  nread == sz
  /home/runner/_work/E3SM/E3SM/externals/ekat/src/ekat/util/ekat_file_utils.
  hpp:24
  read: nread = 389631 sz = 389632

Probably related to me add the shoc_1p5tke bool to the data structure for these tests? Note that this was the part I was least confident about in my changes so if someone could double check this for me...

@AaronDonahue
Copy link
Contributor

@tcclevenger would you be able to check on the test that Peter B. posts about up above?

@tcclevenger
Copy link
Contributor

@bogensch @AaronDonahue Yeah, unit test related, should be a simple fix. I'll take a look.

Comment on lines 287 to 290
DiagSecondMomentsData(36, 72, 73, 2),
DiagSecondMomentsData(72, 72, 73, 2),
DiagSecondMomentsData(128, 72, 73, 2),
DiagSecondMomentsData(256, 72, 73, 2),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bogensch Isn't this a bool? Why is it being set to 2?

@bogensch bogensch requested a review from tcclevenger May 14, 2025 17:34
@bogensch
Copy link
Contributor Author

Thanks @tcclevenger for your help. This issue should be addressed now.

@tcclevenger
Copy link
Contributor

So I have triggered the fails on my machine, and they are purely due to adding a data input in the shoc unit tests, so loading baselines complains that the size of the file data doesn't match the test data struct (which is expected). We can then merge and bless baselines and we are good.

The issue is that it could be that this PR is non-BFB, but because we fail before comparing baselines we miss it, and we bless non-bfb changes. So then the thing to do would be remove shoc_1p5tke from the testing data structs, add it manually to the _host() test functions, verify BFB, merge, then new PR which moves that to the testing structs.

That's way overkill in my opinion because all the CIME cases are passing, so I vote that I merge and bless. I'll do this if I can get @AaronDonahue or @bartgol to agree. (Or they can point out a flaw in my logic)

@mahf708
Copy link
Contributor

mahf708 commented May 14, 2025

So I have triggered the fails on my machine, and they are purely due to adding a data input in the shoc unit tests, so loading baselines complains that the size of the file data doesn't match the test data struct (which is expected). We can then merge and bless baselines and we are good.

The issue is that it could be that this PR is non-BFB, but because we fail before comparing baselines we miss it, and we bless non-bfb changes. So then the thing to do would be remove shoc_1p5tke from the testing data structs, add it manually to the _host() test functions, verify BFB, merge, then new PR which moves that to the testing structs.

That's way overkill in my opinion because all the CIME cases are passing, so I vote that I merge and bless. I'll do this if I can get @AaronDonahue or @bartgol to agree. (Or they can point out a flaw in my logic)

I'd vote against merging without fixing the underlying issue, and I'd vote to change the cmp such that it is only comparing content that's supposed to be compared. I ran into similarly annoying stuff in p3 unit tests, but there, the size of the data struct is a parameter in one of the hpp files. This obviously will come up again and again, right?

@mahf708
Copy link
Contributor

mahf708 commented May 14, 2025

I now see the offending code is here: https://github.com/E3SM-Project/EKAT/blob/78bdbc996838363b97adcb0427af21603d15b5e1/src/ekat/util/ekat_file_utils.hpp#L21-L25

I would entirely remove the check after fread:

template<typename T>
void read (T* v, size_t sz, const FILEPtr& fid) {
  size_t nread = fread(v, sizeof(T), sz, fid.get());
- EKAT_REQUIRE_MSG(nread == sz, "read: nread = " << nread << " sz = " << sz);
}

and then replace sz in call in shoc with something like 2x itself or much larger

I think fread already guarantees that nread <= sz

...

but since this is an ekat problem, I don't know how to proceed...

@tcclevenger
Copy link
Contributor

Simply commenting out the check for file size being equal doesn't allow me to test. Next week I plan to manually set this new param outside of the BFB tests so that tests will pass. Later we can think about passing it to the test data structures if we want to test both true and false.

@mahf708
Copy link
Contributor

mahf708 commented May 16, 2025

Simply commenting out the check for file size being equal doesn't allow me to test. Next week I plan to manually set this new param outside of the BFB tests so that tests will pass. Later we can think about passing it to the test data structures if we want to test both true and false.

are you sure? it passes locally for me (pm-gpu opt)

@mahf708
Copy link
Contributor

mahf708 commented May 16, 2025

@tcclevenger, check this out: https://github.com/E3SM-Project/E3SM/actions/runs/15059769481 (testing this ginle commit 8cbcc38 on top of @bogensch's branch)

Results: you see actual DIFFs in comparison, and not just the fread stuff, and I am not sure these are expected (examples below):

example 1

-------------------------------------------------------------------------------
shoc_length_bfb
-------------------------------------------------------------------------------
/home/runner/_work/E3SM/E3SM/components/eamxx/src/physics/shoc/tests/shoc_length_tests.cpp:337
...............................................................................

/home/runner/_work/E3SM/E3SM/components/eamxx/src/physics/shoc/tests/shoc_length_tests.cpp:311: FAILED:
  REQUIRE( d_baseline.brunt[k] == d_cxx.brunt[k] )
with expansion:
  -0.0f == -16.89397f

-------------------------------------------------------------------------------
shoc_mix_length_bfb
-------------------------------------------------------------------------------
/home/runner/_work/E3SM/E3SM/components/eamxx/src/physics/shoc/tests/shoc_mix_length_tests.cpp:246
...............................................................................

/home/runner/_work/E3SM/E3SM/components/eamxx/src/physics/shoc/tests/shoc_mix_length_tests.cpp:221: FAILED:
  REQUIRE( d_baseline.shoc_mix[k] == d_cxx.shoc_mix[k] )
with expansion:
  -3791530171772424091704779341824.0f
  ==
  13.2003f

-------------------------------------------------------------------------------
shoc_tke_bfb
-------------------------------------------------------------------------------
/home/runner/_work/E3SM/E3SM/components/eamxx/src/physics/shoc/tests/shoc_tke_tests.cpp:410
...............................................................................

/home/runner/_work/E3SM/E3SM/components/eamxx/src/physics/shoc/tests/shoc_tke_tests.cpp:382: FAILED:
  REQUIRE( d_baseline.tke[k] == d_cxx.tke[k] )
with expansion:
  0.0f == 0.0004f

-------------------------------------------------------------------------------
shoc_tke_adv_sgs_tke_bfb
-------------------------------------------------------------------------------
/home/runner/_work/E3SM/E3SM/components/eamxx/src/physics/shoc/tests/shoc_tke_adv_sgs_tke_tests.cpp:323
...............................................................................

/home/runner/_work/E3SM/E3SM/components/eamxx/src/physics/shoc/tests/shoc_tke_adv_sgs_tke_tests.cpp:297: FAILED:
  REQUIRE( d_baseline.tke[k] == d_cxx.tke[k] )
with expansion:
  14853409233705726201873563648.0f == 0.0004f

-------------------------------------------------------------------------------
shoc_tke_eddy_diffusivities_bfb
-------------------------------------------------------------------------------
/home/runner/_work/E3SM/E3SM/components/eamxx/src/physics/shoc/tests/shoc_eddy_diffusivities_tests.cpp:429
...............................................................................

/home/runner/_work/E3SM/E3SM/components/eamxx/src/physics/shoc/tests/shoc_eddy_diffusivities_tests.cpp:403: FAILED:
  REQUIRE( d_baseline.tkh[k] == d_cxx.tkh[k] )
with expansion:
  0.0f == 0.03741f

-------------------------------------------------------------------------------
shoc_diag_third_bfb
-------------------------------------------------------------------------------
/home/runner/_work/E3SM/E3SM/components/eamxx/src/physics/shoc/tests/shoc_diag_third_tests.cpp:302
...............................................................................

/home/runner/_work/E3SM/E3SM/components/eamxx/src/physics/shoc/tests/shoc_diag_third_tests.cpp:[277](https://github.com/E3SM-Project/E3SM/actions/runs/15059769481/job/42332577061#step:6:282): FAILED:
  REQUIRE( d_baseline.w3[k] == d_cxx.w3[k] )
with expansion:
  -0.0f == 0.0f

-------------------------------------------------------------------------------
shoc_comp_diag_third_bfb
-------------------------------------------------------------------------------
/home/runner/_work/E3SM/E3SM/components/eamxx/src/physics/shoc/tests/shoc_compute_diag_third_tests.cpp:286
...............................................................................

/home/runner/_work/E3SM/E3SM/components/eamxx/src/physics/shoc/tests/shoc_compute_diag_third_tests.cpp:261: FAILED:
  REQUIRE( d_baseline.w3[k] == d_cxx.w3[k] )
with expansion:
  -0.0f == 0.0f

example 2

WARNING: Tl1_1 has 7 values <= allowable value.  Resetting to minimum value.
WARNING: Tl1_2 has 7 values <= allowable value.  Resetting to minimum value.
-------------------------------------------------------------------------------
diag_second_moments_bfb
-------------------------------------------------------------------------------
/home/runner/_work/E3SM/E3SM/components/eamxx/src/physics/shoc/tests/shoc_diag_second_moments_tests.cpp:364
...............................................................................
/home/runner/_work/E3SM/E3SM/components/eamxx/src/physics/shoc/tests/shoc_diag_second_moments_tests.cpp:328: FAILED:
  REQUIRE( d_baseline.w_sec[k] == d_cxx.w_sec[k] )
with expansion:
  0.0 == 0.5888247862
-------------------------------------------------------------------------------
diag_second_shoc_moments_bfb
-------------------------------------------------------------------------------
/home/runner/_work/E3SM/E3SM/components/eamxx/src/physics/shoc/tests/shoc_diag_second_shoc_moments_tests.cpp:377
...............................................................................
/home/runner/_work/E3SM/E3SM/components/eamxx/src/physics/shoc/tests/shoc_diag_second_shoc_moments_tests.cpp:341: FAILED:
  REQUIRE( d_baseline.w_sec[k] == d_cxx.w_sec[k] )
with expansion:
  -0.0 == 0.5145583301
===============================================================================
test cases:    110 |    101 passed | 9 failed
assertions: 222245 | 222236 passed | 9 failed
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
  Process name: [[55269,1],0]
  Exit code:    1
--------------------------------------------------------------------------

@tcclevenger
Copy link
Contributor

@tcclevenger, check this out: https://github.com/E3SM-Project/E3SM/actions/runs/15059769481 (testing this ginle commit 8cbcc38 on top of @bogensch's branch)

Results: you see actual DIFFs in comparison, and not just the fread stuff, and I am not sure these are expected (examples below):

example 1
example 2

The example I was testing looked more like nonsense, so I didn't know if the file comparison was off line because of the different sizes, but yeah these look like we are just setting to 0. I'll manually add the new param in one of the tests you posted to see what happens.

@bartgol
Copy link
Contributor

bartgol commented May 19, 2025

@tcclevenger if you are sure that the failure is due to the new param (that is not found in the baselines file), then I would just merge and bless.

Simply commenting out the check for file size being equal doesn't allow me to test.

I am not sure I understand this comment though. What happens when you comment the check?

@tcclevenger
Copy link
Contributor

@tcclevenger if you are sure that the failure is due to the new param (that is not found in the baselines file), then I would just merge and bless.

I'm not sure, I would have guessed that it was due to the new param, but it's not clear to me that is the case. So I want to do at least a quick check.

When I comment out the check I get a baseline diff which contains wildly different values. This doesn't make sense to me since the CIME cases are BFB. Not sure if it's unit test specific, or if the files being mismatched sizes means that we are comparing the correct values.

@tcclevenger
Copy link
Contributor

@mahf708 Manually setting new param to false and removing from the testing data struct causes both the file size comparison and the bfb array comparisons to pass. I think the errors come from comparing files of different sizes, and not BFB issues in the source code. I'm going to test with the 2 examples you posted to be sure.

If we want to go a more rigorous route, I could make a quick PR that only adds the new param to the testing data, merge that and bless baselines, then have this PR actually use the new param.

@tcclevenger
Copy link
Contributor

Yeah, tested the examples and same thing, I'm confident this BFB. Merging and blessing.

tcclevenger added a commit that referenced this pull request May 19, 2025
This PR adds an option to run SHOC with no SGS variability. This
essentially reduces SHOC to a 1.5 TKE closure in the vertical.

If this option is activated (shoc_1p5tke= true):
  - The scalar variances and covariances are set to zero.
  - The third moment of vertical velocity is set to zero.
  - The above two assumptions will reduce the assumed PDF to an
    all-or-nothing closure.
  - Since an all-or-nothing closure is used this means SHOC’s
    buoyancy flux parameterization is now ill-posed due to the
    liquid water flux (w'ql') always being zero. The buoyancy
    flux is needed to close the buoyant production term in the
    TKE budget. In this case we close the buoyant production of
    TKE using the local moist brunt vaisalla frequency, which
    follows many 1.5 TKE closures.
  - Modifies the definition of the eddy diffusivities to align
    better with formulations used in a 1.5 scheme.
  - Modifies the length scale definition to be exactly that of
    SAM when a 1.5 TKE closure is activated.

I have verified in several cases, and one ne30 simulation, that if
this option is activate that cloud fraction will always be either
0 or 1 using instantaneous output.

[BFB]
@tcclevenger tcclevenger merged commit beaa26a into master May 19, 2025
11 of 33 checks passed
@tcclevenger tcclevenger deleted the bogensch/shoc_1p5tke branch May 19, 2025 17:26
@bartgol
Copy link
Contributor

bartgol commented May 19, 2025

@tcclevenger Yeah, the check against baselines is quite "raw": we parse the file in binary form, and compare against expected values. Since the branch has an extra scalar, it probably gobbled down the scalars "shifted by one", causing wildly different values.

@mahf708
Copy link
Contributor

mahf708 commented May 19, 2025

Thanks @tcclevenger 👍

@mahf708
Copy link
Contributor

mahf708 commented May 20, 2025

Looks like this PR caused some fpe errors: https://my.cdash.org/tests/273678509

@tcclevenger
Copy link
Contributor

Looks like this PR caused some fpe errors: https://my.cdash.org/tests/273678509

Investigating.

tcclevenger added a commit that referenced this pull request May 21, 2025
Fixes FPE introduced in #7188 from temporary values being computed
which attempt to take sqrt(some negative value). These values weren't
being used, so not an actual bug.

[BFB]
tcclevenger added a commit that referenced this pull request May 21, 2025
Fixes FPE introduced in #7188 from temporary values being computed
which attempt to take sqrt(some negative value). These values weren't
being used, so not an actual bug.

[BFB]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BFB PR leaves answers BFB EAMxx Issues related to EAMxx
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants