SNLS's trust region dogleg solver + various small bug fixes#61
Open
rcarson3 wants to merge 7 commits into
Open
SNLS's trust region dogleg solver + various small bug fixes#61rcarson3 wants to merge 7 commits into
rcarson3 wants to merge 7 commits into
Conversation
Largely had Claude drive this port as it the original SNLS trust region dogleg solver was relatively straightforward. Most of it would map over to MFEM's framework and wasn't overly complicated. The big boon though from having Claude do this is I also had it add the necessary mult transpose operators for the nonlinearform integrators and in particular this was done for the PA forms as the EA was trivial and the full matrix version already implemented it. I was also surprised that it was able to do the BBar PA Grad implementation of everything as well. The math behind it was actually more straightforward than I was expecting, and it was cool to see how it could derive all of it using different relationship it had discovered from looking at the full integration form.
… mfem::Array<bool> Was getting some odd GPU failures here at one point and just moved over to std::array<bool, 3> to get rid of them as I didn't want to deal with the odd memory issues I was hitting. The MFEM stuff was still fine. I just think I was hitting an odd bug in some GPU kernel, but moving to the std::array is ultimately the better approach as it honestly makes more sense for this stuff...
Had codex help work out what changes were needed to get ExaConstit updated to work with MFEM v4.9+ as I wanted to make sure we could use newer versions of Hypre as apparently that works better with ROCm v7+
Found a couple bugs in the post-processing: First one is that volume average values were getting outputted every time step designated for the viz files and the viz files were getting outputted every time step. So caught and fixed that issue. Next found in certain cases there could be segfaults due to some models not having a variable defined and thus a vdim equal to 0 causing segfaults which was a fun bug to run down... Finally this was one I had codex help dissect and chase down, we had some fun MPI stalls occurring and it was due to how data was being solved requiring the global communicator rather than a region defined communicator... Didn't notice this in earlier testing as our regions were usually on every rank... I'd still blame this one though on how MFEM is handling the communicators for meshes / par finite element spaces and in turn how this relates to the data collections as well...
Was running into a nasty GPU-bug where on newer versions of MFEM if we used > 1 GPUs we got different answers than our CPU runs as certain terms were just 0.0. I could not for the life of me figure it out other than it was likely due to some thing in the velocity field near the time the boundary conditions were being applied was not getting set. I threw codex at the problem and it was able to over a couple iterations of debugging work out the error and find a suitable new MFEM API that we could use that fixed the GPU error and kept our answers on the CPU the same.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Port of SNLS's trust region dogleg solver over to ExaConstit - I had Claude largely do this as this would have been a rather simple port as the math from both codes maps fairly easily, and I needed to test a wide range of new solvers to try and make some head ways for some tougher material modelling cases... Claude honestly did a good job and even did a complete port of the BBar method to the PA, so we fully support that code over there now as well...
Also I'm including a number of fun bug fixes that I've accumulated since the last release. Some of these were quite non-trivial and I spent quite a bit of time trying to nail down what was causing things to awol in some problems. These were largely MPI stalls or new GPU bugs found when moving to new versions of MFEM. For these, I had thrown codex at them after figuring out largely where the bugs might be but not be able to work out the exact locations when they were happening at scale. It was able to automate a lot of the MPI logging aspect and work out where things were stalling or diverging much faster than I would have been able to type. So definitely a nice debugger tool to use going forward...