EAMxx: Option to run SHOC with no SGS variability (1.5 closure) #7188

bogensch · 2025-03-28T23:27:56Z

This PR adds an option to run SHOC with no SGS variability. This essentially reduces SHOC to a 1.5 TKE closure in the vertical.

If this option is activated (shoc_1p5tke= true):

The scalar variances and covariances are set to zero.
The third moment of vertical velocity is set to zero.
The above two assumptions will reduce the assumed PDF to an all-or-nothing closure.
Since an all-or-nothing closure is used this means SHOC’s buoyancy flux parameterization is now ill-posed due to the liquid water flux (w'ql') always being zero. The buoyancy flux is needed to close the buoyant production term in the TKE budget. In this case we close the buoyant production of TKE using the local moist brunt vaisalla frequency, which follows many 1.5 TKE closures.
Modifies the definition of the eddy diffusivities to align better with formulations used in a 1.5 scheme.
Modifies the length scale definition to be exactly that of SAM when a 1.5 TKE closure is activated.

I have verified in several cases, and one ne30 simulation, that if this option is activate that cloud fraction will always be either 0 or 1 using instantaneous output.

Since this scheme is set to false by default, this should be a b4b PR.

…r 1p5 TKE closure

…nteractively and should be kept for output purposes

mahf708

Minor formatting issues, otherwise looks good. Preliminary approval to get the CI to run.

AaronDonahue

Looks good to me. I had a question about one of the functions.

components/eamxx/src/physics/shoc/tests/infra/shoc_test_data.cpp

tcclevenger · 2025-05-08T18:38:36Z

Shoc unit tests failing on both gpu and cpu. CIME cases seem fine.

bogensch · 2025-05-12T19:26:14Z

Total bummer summer.

It looks like (from what I can tell) the SHOC property tests are passing but the b4b tests are failing for the routines I modified. Example, they are all failing with the error:

-------------------------------------------------------------------------------
diag_second_moments_bfb
-------------------------------------------------------------------------------
/home/runner/_work/E3SM/E3SM/components/eamxx/src/physics/shoc/tests/shoc_diag_second_moments_tests.cpp:364
...............................................................................

/home/runner/_work/E3SM/E3SM/components/eamxx/src/physics/shoc/tests/shoc_diag_second_moments_tests.cpp:364: FAILED:
due to unexpected exception with message:
  
 FAIL:
  nread == sz
  /home/runner/_work/E3SM/E3SM/externals/ekat/src/ekat/util/ekat_file_utils.
  hpp:24
  read: nread = 389631 sz = 389632

Probably related to me add the shoc_1p5tke bool to the data structure for these tests? Note that this was the part I was least confident about in my changes so if someone could double check this for me...

AaronDonahue · 2025-05-12T22:36:54Z

@tcclevenger would you be able to check on the test that Peter B. posts about up above?

tcclevenger · 2025-05-13T14:25:52Z

@bogensch @AaronDonahue Yeah, unit test related, should be a simple fix. I'll take a look.

tcclevenger · 2025-05-13T20:14:08Z

components/eamxx/src/physics/shoc/tests/shoc_diag_second_moments_tests.cpp

+      DiagSecondMomentsData(36,  72, 73, 2),
+      DiagSecondMomentsData(72,  72, 73, 2),
+      DiagSecondMomentsData(128, 72, 73, 2),
+      DiagSecondMomentsData(256, 72, 73, 2),


@bogensch Isn't this a bool? Why is it being set to 2?

bogensch · 2025-05-14T17:35:23Z

Thanks @tcclevenger for your help. This issue should be addressed now.

tcclevenger · 2025-05-14T19:27:08Z

So I have triggered the fails on my machine, and they are purely due to adding a data input in the shoc unit tests, so loading baselines complains that the size of the file data doesn't match the test data struct (which is expected). We can then merge and bless baselines and we are good.

The issue is that it could be that this PR is non-BFB, but because we fail before comparing baselines we miss it, and we bless non-bfb changes. So then the thing to do would be remove shoc_1p5tke from the testing data structs, add it manually to the _host() test functions, verify BFB, merge, then new PR which moves that to the testing structs.

That's way overkill in my opinion because all the CIME cases are passing, so I vote that I merge and bless. I'll do this if I can get @AaronDonahue or @bartgol to agree. (Or they can point out a flaw in my logic)

mahf708 · 2025-05-14T21:17:25Z

So I have triggered the fails on my machine, and they are purely due to adding a data input in the shoc unit tests, so loading baselines complains that the size of the file data doesn't match the test data struct (which is expected). We can then merge and bless baselines and we are good.

The issue is that it could be that this PR is non-BFB, but because we fail before comparing baselines we miss it, and we bless non-bfb changes. So then the thing to do would be remove shoc_1p5tke from the testing data structs, add it manually to the _host() test functions, verify BFB, merge, then new PR which moves that to the testing structs.

That's way overkill in my opinion because all the CIME cases are passing, so I vote that I merge and bless. I'll do this if I can get @AaronDonahue or @bartgol to agree. (Or they can point out a flaw in my logic)

I'd vote against merging without fixing the underlying issue, and I'd vote to change the cmp such that it is only comparing content that's supposed to be compared. I ran into similarly annoying stuff in p3 unit tests, but there, the size of the data struct is a parameter in one of the hpp files. This obviously will come up again and again, right?

mahf708 · 2025-05-14T21:24:20Z

I now see the offending code is here: https://github.com/E3SM-Project/EKAT/blob/78bdbc996838363b97adcb0427af21603d15b5e1/src/ekat/util/ekat_file_utils.hpp#L21-L25

I would entirely remove the check after fread:

template<typename T>
void read (T* v, size_t sz, const FILEPtr& fid) {
  size_t nread = fread(v, sizeof(T), sz, fid.get());
- EKAT_REQUIRE_MSG(nread == sz, "read: nread = " << nread << " sz = " << sz);
}

~~and then replace sz in call in shoc with something like 2x itself or much larger~~

I think fread already guarantees that nread <= sz

...

but since this is an ekat problem, I don't know how to proceed...

tcclevenger · 2025-05-16T00:43:22Z

Simply commenting out the check for file size being equal doesn't allow me to test. Next week I plan to manually set this new param outside of the BFB tests so that tests will pass. Later we can think about passing it to the test data structures if we want to test both true and false.

mahf708 · 2025-05-16T02:36:07Z

Simply commenting out the check for file size being equal doesn't allow me to test. Next week I plan to manually set this new param outside of the BFB tests so that tests will pass. Later we can think about passing it to the test data structures if we want to test both true and false.

are you sure? it passes locally for me (pm-gpu opt)

mahf708 · 2025-05-16T03:34:26Z

@tcclevenger, check this out: https://github.com/E3SM-Project/E3SM/actions/runs/15059769481 (testing this ginle commit 8cbcc38 on top of @bogensch's branch)

Results: you see actual DIFFs in comparison, and not just the fread stuff, and I am not sure these are expected (examples below):

example 1

-------------------------------------------------------------------------------
shoc_length_bfb
-------------------------------------------------------------------------------
/home/runner/_work/E3SM/E3SM/components/eamxx/src/physics/shoc/tests/shoc_length_tests.cpp:337
...............................................................................

/home/runner/_work/E3SM/E3SM/components/eamxx/src/physics/shoc/tests/shoc_length_tests.cpp:311: FAILED:
  REQUIRE( d_baseline.brunt[k] == d_cxx.brunt[k] )
with expansion:
  -0.0f == -16.89397f

-------------------------------------------------------------------------------
shoc_mix_length_bfb
-------------------------------------------------------------------------------
/home/runner/_work/E3SM/E3SM/components/eamxx/src/physics/shoc/tests/shoc_mix_length_tests.cpp:246
...............................................................................

/home/runner/_work/E3SM/E3SM/components/eamxx/src/physics/shoc/tests/shoc_mix_length_tests.cpp:221: FAILED:
  REQUIRE( d_baseline.shoc_mix[k] == d_cxx.shoc_mix[k] )
with expansion:
  -3791530171772424091704779341824.0f
  ==
  13.2003f

-------------------------------------------------------------------------------
shoc_tke_bfb
-------------------------------------------------------------------------------
/home/runner/_work/E3SM/E3SM/components/eamxx/src/physics/shoc/tests/shoc_tke_tests.cpp:410
...............................................................................

/home/runner/_work/E3SM/E3SM/components/eamxx/src/physics/shoc/tests/shoc_tke_tests.cpp:382: FAILED:
  REQUIRE( d_baseline.tke[k] == d_cxx.tke[k] )
with expansion:
  0.0f == 0.0004f

-------------------------------------------------------------------------------
shoc_tke_adv_sgs_tke_bfb
-------------------------------------------------------------------------------
/home/runner/_work/E3SM/E3SM/components/eamxx/src/physics/shoc/tests/shoc_tke_adv_sgs_tke_tests.cpp:323
...............................................................................

/home/runner/_work/E3SM/E3SM/components/eamxx/src/physics/shoc/tests/shoc_tke_adv_sgs_tke_tests.cpp:297: FAILED:
  REQUIRE( d_baseline.tke[k] == d_cxx.tke[k] )
with expansion:
  14853409233705726201873563648.0f == 0.0004f

-------------------------------------------------------------------------------
shoc_tke_eddy_diffusivities_bfb
-------------------------------------------------------------------------------
/home/runner/_work/E3SM/E3SM/components/eamxx/src/physics/shoc/tests/shoc_eddy_diffusivities_tests.cpp:429
...............................................................................

/home/runner/_work/E3SM/E3SM/components/eamxx/src/physics/shoc/tests/shoc_eddy_diffusivities_tests.cpp:403: FAILED:
  REQUIRE( d_baseline.tkh[k] == d_cxx.tkh[k] )
with expansion:
  0.0f == 0.03741f

-------------------------------------------------------------------------------
shoc_diag_third_bfb
-------------------------------------------------------------------------------
/home/runner/_work/E3SM/E3SM/components/eamxx/src/physics/shoc/tests/shoc_diag_third_tests.cpp:302
...............................................................................

/home/runner/_work/E3SM/E3SM/components/eamxx/src/physics/shoc/tests/shoc_diag_third_tests.cpp:[277](https://github.com/E3SM-Project/E3SM/actions/runs/15059769481/job/42332577061#step:6:282): FAILED:
  REQUIRE( d_baseline.w3[k] == d_cxx.w3[k] )
with expansion:
  -0.0f == 0.0f

-------------------------------------------------------------------------------
shoc_comp_diag_third_bfb
-------------------------------------------------------------------------------
/home/runner/_work/E3SM/E3SM/components/eamxx/src/physics/shoc/tests/shoc_compute_diag_third_tests.cpp:286
...............................................................................

/home/runner/_work/E3SM/E3SM/components/eamxx/src/physics/shoc/tests/shoc_compute_diag_third_tests.cpp:261: FAILED:
  REQUIRE( d_baseline.w3[k] == d_cxx.w3[k] )
with expansion:
  -0.0f == 0.0f

example 2

WARNING: Tl1_1 has 7 values <= allowable value.  Resetting to minimum value.
WARNING: Tl1_2 has 7 values <= allowable value.  Resetting to minimum value.
-------------------------------------------------------------------------------
diag_second_moments_bfb
-------------------------------------------------------------------------------
/home/runner/_work/E3SM/E3SM/components/eamxx/src/physics/shoc/tests/shoc_diag_second_moments_tests.cpp:364
...............................................................................
/home/runner/_work/E3SM/E3SM/components/eamxx/src/physics/shoc/tests/shoc_diag_second_moments_tests.cpp:328: FAILED:
  REQUIRE( d_baseline.w_sec[k] == d_cxx.w_sec[k] )
with expansion:
  0.0 == 0.5888247862
-------------------------------------------------------------------------------
diag_second_shoc_moments_bfb
-------------------------------------------------------------------------------
/home/runner/_work/E3SM/E3SM/components/eamxx/src/physics/shoc/tests/shoc_diag_second_shoc_moments_tests.cpp:377
...............................................................................
/home/runner/_work/E3SM/E3SM/components/eamxx/src/physics/shoc/tests/shoc_diag_second_shoc_moments_tests.cpp:341: FAILED:
  REQUIRE( d_baseline.w_sec[k] == d_cxx.w_sec[k] )
with expansion:
  -0.0 == 0.5145583301
===============================================================================
test cases:    110 |    101 passed | 9 failed
assertions: 222245 | 222236 passed | 9 failed
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
  Process name: [[55269,1],0]
  Exit code:    1
--------------------------------------------------------------------------

tcclevenger · 2025-05-19T13:10:44Z

@tcclevenger, check this out: https://github.com/E3SM-Project/E3SM/actions/runs/15059769481 (testing this ginle commit 8cbcc38 on top of @bogensch's branch)

Results: you see actual DIFFs in comparison, and not just the fread stuff, and I am not sure these are expected (examples below):

example 1
example 2

The example I was testing looked more like nonsense, so I didn't know if the file comparison was off line because of the different sizes, but yeah these look like we are just setting to 0. I'll manually add the new param in one of the tests you posted to see what happens.

bartgol · 2025-05-19T14:26:23Z

@tcclevenger if you are sure that the failure is due to the new param (that is not found in the baselines file), then I would just merge and bless.

Simply commenting out the check for file size being equal doesn't allow me to test.

I am not sure I understand this comment though. What happens when you comment the check?

tcclevenger · 2025-05-19T14:30:14Z

@tcclevenger if you are sure that the failure is due to the new param (that is not found in the baselines file), then I would just merge and bless.

I'm not sure, I would have guessed that it was due to the new param, but it's not clear to me that is the case. So I want to do at least a quick check.

When I comment out the check I get a baseline diff which contains wildly different values. This doesn't make sense to me since the CIME cases are BFB. Not sure if it's unit test specific, or if the files being mismatched sizes means that we are comparing the correct values.

tcclevenger · 2025-05-19T17:08:15Z

@mahf708 Manually setting new param to false and removing from the testing data struct causes both the file size comparison and the bfb array comparisons to pass. I think the errors come from comparing files of different sizes, and not BFB issues in the source code. I'm going to test with the 2 examples you posted to be sure.

If we want to go a more rigorous route, I could make a quick PR that only adds the new param to the testing data, merge that and bless baselines, then have this PR actually use the new param.

tcclevenger · 2025-05-19T17:19:40Z

Yeah, tested the examples and same thing, I'm confident this BFB. Merging and blessing.

This PR adds an option to run SHOC with no SGS variability. This essentially reduces SHOC to a 1.5 TKE closure in the vertical. If this option is activated (shoc_1p5tke= true): - The scalar variances and covariances are set to zero. - The third moment of vertical velocity is set to zero. - The above two assumptions will reduce the assumed PDF to an all-or-nothing closure. - Since an all-or-nothing closure is used this means SHOC’s buoyancy flux parameterization is now ill-posed due to the liquid water flux (w'ql') always being zero. The buoyancy flux is needed to close the buoyant production term in the TKE budget. In this case we close the buoyant production of TKE using the local moist brunt vaisalla frequency, which follows many 1.5 TKE closures. - Modifies the definition of the eddy diffusivities to align better with formulations used in a 1.5 scheme. - Modifies the length scale definition to be exactly that of SAM when a 1.5 TKE closure is activated. I have verified in several cases, and one ne30 simulation, that if this option is activate that cloud fraction will always be either 0 or 1 using instantaneous output. [BFB]

bartgol · 2025-05-19T18:37:40Z

@tcclevenger Yeah, the check against baselines is quite "raw": we parse the file in binary form, and compare against expected values. Since the branch has an extra scalar, it probably gobbled down the scalars "shifted by one", causing wildly different values.

mahf708 · 2025-05-19T20:17:07Z

Thanks @tcclevenger 👍

mahf708 · 2025-05-20T13:45:39Z

Looks like this PR caused some fpe errors: https://my.cdash.org/tests/273678509

tcclevenger · 2025-05-20T13:58:21Z

Looks like this PR caused some fpe errors: https://my.cdash.org/tests/273678509

Investigating.

Fixes FPE introduced in #7188 from temporary values being computed which attempt to take sqrt(some negative value). These values weren't being used, so not an actual bug. [BFB]

Peter Bogenschutz and others added 19 commits December 16, 2024 15:52

add 1.5 TKE option to SCREAM

fc122d3

first round of fixes for 1.5 TKE closure

248fd66

modifications to get the code to compile

fd8ebb3

bring shoc1p5 tke option out to namelist

f849dd2

first round of changes for SHOC 1.5 TKE runtime option

46da4cb

fixes for 1p5 TKE closure

1c136d9

final round of fixes to get code to compile

66314dd

add stability correction term to dz based length scale computation fo…

02c71fd

…r 1p5 TKE closure

fixes to get property tests to mostly build

4196410

Merge remote-tracking branch 'origin' into bogensch/shoc_1p5tke

e852143

do not set momentum fluxes to zero for 1.5 TKE as they are not used i…

da60ed3

…nteractively and should be kept for output purposes

remove dz depedent length scale definition if 1.5 TKE is used

dd506af

updates to get property tests to mostly compile

e93e8ae

fixes to finally get standalone tests to compile

f040d0f

change flag from tke_1p5_closure to shoc_nosgs_var

65c441c

whitespace issue

75b81fc

modify some comments and fix some whitespace issues

534e23c

modify comment regarding the buoyant production term

e33a9a0

refactor way w3 is set to zero if no SGS variability is desired

76294b5

bogensch added BFB PR leaves answers BFB EAMxx Issues related to EAMxx labels Mar 28, 2025

bogensch requested review from AaronDonahue, hassanbeydoun and mahf708 March 28, 2025 23:27

mahf708 approved these changes Mar 30, 2025

View reviewed changes

rljacob assigned bartgol Mar 31, 2025

AaronDonahue reviewed Apr 1, 2025

View reviewed changes

components/eamxx/src/physics/shoc/tests/infra/shoc_test_data.cpp Outdated Show resolved Hide resolved

components/eamxx/src/physics/shoc/tests/infra/shoc_test_data.cpp Outdated Show resolved Hide resolved

tcclevenger assigned tcclevenger and unassigned bartgol Apr 1, 2025

bogensch added the wip work in progress label Apr 9, 2025

tcclevenger reviewed May 13, 2025

View reviewed changes

fix issue with bool being read in to b4b tests

48252dd

bogensch requested a review from tcclevenger May 14, 2025 17:34

tcclevenger approved these changes May 14, 2025

View reviewed changes

tcclevenger merged commit beaa26a into master May 19, 2025
11 of 33 checks passed

tcclevenger deleted the bogensch/shoc_1p5tke branch May 19, 2025 17:26

tcclevenger mentioned this pull request May 20, 2025

EAMxx: Fix FPE in shoc_compute_shoc_mix_shoc_length_impl.hpp #7368

Merged

EAMxx: Option to run SHOC with no SGS variability (1.5 closure) #7188

EAMxx: Option to run SHOC with no SGS variability (1.5 closure) #7188

Uh oh!

Conversation

bogensch commented Mar 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mahf708 left a comment

Choose a reason for hiding this comment

Uh oh!

AaronDonahue left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tcclevenger commented May 8, 2025

Uh oh!

bogensch commented May 12, 2025

Uh oh!

AaronDonahue commented May 12, 2025

Uh oh!

tcclevenger commented May 13, 2025

Uh oh!

tcclevenger May 13, 2025

Choose a reason for hiding this comment

Uh oh!

bogensch commented May 14, 2025

Uh oh!

tcclevenger commented May 14, 2025

Uh oh!

mahf708 commented May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mahf708 commented May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tcclevenger commented May 16, 2025

Uh oh!

mahf708 commented May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mahf708 commented May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tcclevenger commented May 19, 2025

Uh oh!

bartgol commented May 19, 2025

Uh oh!

tcclevenger commented May 19, 2025

Uh oh!

tcclevenger commented May 19, 2025

Uh oh!

tcclevenger commented May 19, 2025

Uh oh!

Uh oh!

bartgol commented May 19, 2025

Uh oh!

mahf708 commented May 19, 2025

Uh oh!

mahf708 commented May 20, 2025

Uh oh!

tcclevenger commented May 20, 2025

Uh oh!

Uh oh!

bogensch commented Mar 28, 2025 •

edited

Loading

mahf708 commented May 14, 2025 •

edited

Loading

mahf708 commented May 14, 2025 •

edited

Loading

mahf708 commented May 16, 2025 •

edited

Loading

mahf708 commented May 16, 2025 •

edited

Loading