Skip to content

Improved error reporting for failed tolerance checks #988

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

DimAdam-01
Copy link
Contributor

@DimAdam-01 DimAdam-01 commented Aug 18, 2025

User description

Description

Please include a summary of the changes and the related issue(s) if they exist.
Please also include relevant motivation and context.

Improves the tolerance validation system by adding comprehensive error diagnostics when tests fail.

Current behavior:

  • Fails fast on the first variable that exceeds tolerance (either absolute OR relative)
  • Reports only the location of the first failure
  • Limited diagnostic information for debugging numerical issues

Enhanced behavior:

  • Still fails fast on first tolerance violation (maintains performance)
  • After failure, scans ALL files to identify variables that fail tolerance checks
  • Reports both the first failure location AND the maximum errors among failing variables
  • Provides targeted diagnostic information including:
    • Where the maximum absolute error occurs among failing variables (with its relative error)
    • Where the maximum relative error occurs among failing variables (with its absolute error)

Benefits:

  • Better debugging of numerical accuracy issues
  • Identifies worst-case errors across entire dataset
  • Maintains fast-fail performance for passing tests
  • Comprehensive error analysis for failed test cases

This enhancement helps developers quickly identify both where tests first fail and where the most significant numerical errors occur in the simulation results.

image

Fixes #(issue) [optional]

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Something else

Scope

  • This PR comprises a set of related changes with a common goal

If you cannot check the above box, please split your PR into multiple PRs that each have a common goal.

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.
Provide instructions so we can reproduce.
Please also list any relevant details for your test configuration

  • Test A
  • Test B

Test Configuration:

  • What computers and compilers did you use to test this:

Checklist

  • I have added comments for the new code
  • I added Doxygen docstrings to the new code
  • I have made corresponding changes to the documentation (docs/)
  • I have added regression tests to the test suite so that people can verify in the future that the feature is behaving as expected
  • I have added example cases in examples/ that demonstrate my new feature performing as expected.
    They run to completion and demonstrate "interesting physics"
  • I ran ./mfc.sh format before committing my code
  • New and existing tests pass locally with my changes, including with GPU capability enabled (both NVIDIA hardware with NVHPC compilers and AMD hardware with CRAY compilers) and disabled
  • This PR does not introduce any repeated code (it follows the DRY principle)
  • I cannot think of a way to condense this code and reduce any introduced additional line count

If your code changes any code source files (anything in src/simulation)

To make sure the code is performing as expected on GPU devices, I have:

  • Checked that the code compiles using NVHPC compilers
  • Checked that the code compiles using CRAY compilers
  • Ran the code on either V100, A100, or H100 GPUs and ensured the new feature performed as expected (the GPU results match the CPU results)
  • Ran the code on MI200+ GPUs and ensure the new features performed as expected (the GPU results match the CPU results)
  • Enclosed the new feature via nvtx ranges so that they can be identified in profiles
  • Ran a Nsight Systems profile using ./mfc.sh run XXXX --gpu -t simulation --nsys, and have attached the output file (.nsys-rep) and plain text results to this PR
  • Ran a Rocprof Systems profile using ./mfc.sh run XXXX --gpu -t simulation --rsys --hip-trace, and have attached the output file and plain text results to this PR.
  • Ran my code using various numbers of different GPUs (1, 2, and 8, for example) in parallel and made sure that the results scale similarly to what happens if you run without the new code/feature

PR Type

Enhancement


Description

  • Enhanced tolerance validation with comprehensive error diagnostics

  • Added maximum error scanning across all files after failure

  • Improved error reporting with detailed diagnostic information

  • Maintained fast-fail performance for passing tests


Diagram Walkthrough

flowchart LR
  A["Tolerance Check"] --> B["First Failure Detected"]
  B --> C["Scan ALL Files"]
  C --> D["Find Max Absolute Error"]
  C --> E["Find Max Relative Error"]
  D --> F["Format Diagnostic Report"]
  E --> F
  F --> G["Enhanced Error Message"]
Loading

File Walkthrough

Relevant files
Enhancement
tol.py
Enhanced tolerance validation with comprehensive error diagnostics

toolchain/mfc/packer/tol.py

  • Replaced simple error reporting with comprehensive diagnostics
  • Added find_maximum_errors() function to scan all files for worst-case
    errors
  • Added format_diagnostic_message() function for detailed error
    formatting
  • Enhanced error messages to include maximum absolute and relative error
    locations
+76/-7   

@DimAdam-01 DimAdam-01 requested a review from a team as a code owner August 18, 2025 10:50
Copy link

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

Possible Issue

Helper function defined inside the loop calls find_maximum_errors(candidate, golden) on every early-return path, which re-scans all files. This could be expensive and also relies on golden.entries.items() ordering; ensure deterministic ordering if messages are compared in tests and consider computing diagnostics once per failure or memoizing.

            def raise_err_with_max_diagnostics(msg: str):
                # Find maximum errors across ALL files for diagnostics
                max_errors = find_maximum_errors(candidate, golden)
                diagnostic_msg = format_diagnostic_message(max_errors)

                return None, f"""\
Variable n°{valIndex+1} (1-indexed) in {gFilepath} {msg}:
  - Candidate:   {cVal}
  - Golden:      {gVal}
  - Error:       {error}
  - Tolerance:   {tol}{diagnostic_msg}
"""

            if math.isnan(gVal):
                return raise_err_with_max_diagnostics("is NaN in the golden file")
            if math.isnan(cVal):
                return raise_err_with_max_diagnostics("is NaN in the pack file")
            if not is_close(error, tol):
                return raise_err_with_max_diagnostics("is not within tolerance")
Formatting/Consistency

Diagnostic message mixes plain numbers and scientific notation; relative error sometimes shows "NaN". Consider consistent formatting for candidate/golden values and guarding against "NaN" strings in max-relative section to avoid confusing output.

def format_diagnostic_message(max_errors: typing.Tuple[typing.Optional[typing.Tuple[str, int, float, float, float, float]], typing.Optional[typing.Tuple[str, int, float, float, float, float]]]) -> str:
    """Format the diagnostic message showing maximum errors."""
    max_abs_info, max_rel_info = max_errors
    diagnostic_msg = ""

    if max_abs_info:
        filepath, var_idx, golden_val, candidate_val, abs_error, rel_error = max_abs_info
        rel_error_str = f"{rel_error:.2E}" if not math.isnan(rel_error) else "NaN"
        diagnostic_msg += f"\n\nDiagnostics - Maximum absolute error across ALL files:\n" \
                         f" - File: {filepath}\n" \
                         f" - Variable n°{var_idx+1}\n" \
                         f" - Candidate: {candidate_val}\n" \
                         f" - Golden: {golden_val}\n" \
                         f" - Absolute Error: {abs_error:.2E}\n" \
                         f" - Relative Error: {rel_error_str}"

    if max_rel_info:
        filepath, var_idx, golden_val, candidate_val, rel_error, abs_error = max_rel_info
        diagnostic_msg += f"\n\nDiagnostics - Maximum relative error across ALL files:\n" \
                         f" - File: {filepath}\n" \
                         f" - Variable n°{var_idx+1}\n" \
                         f" - Candidate: {candidate_val}\n" \
                         f" - Golden: {golden_val}\n" \
                         f" - Relative Error: {rel_error:.2E}\n" \
                         f" - Absolute Error: {abs_error:.2E}"

    return diagnostic_msg
Edge Cases

find_maximum_errors skips NaNs entirely; if all values are NaN or only NaN in one file, both maxima return None leading to no diagnostics block. Consider explicit messaging when maxima are unavailable, and verify behavior when candidate is missing entries (currently just continues).

def find_maximum_errors(candidate: Pack, golden: Pack) -> typing.Tuple[typing.Optional[typing.Tuple[str, int, float, float, float, float]], typing.Optional[typing.Tuple[str, int, float, float, float, float]]]:
    """
    Scan all files to find the maximum absolute and relative errors.
    Returns tuple of:
    - max_abs_info: (filepath, var_index, golden_val, candidate_val, absolute_error, relative_error)
    - max_rel_info: (filepath, var_index, golden_val, candidate_val, relative_error, absolute_error)
    """
    max_abs_error = -1.0
    max_abs_info = None

    max_rel_error = -1.0
    max_rel_info = None

    for gFilepath, gEntry in golden.entries.items():
        cEntry = candidate.find(gFilepath)
        if cEntry is None:
            continue

        for valIndex, (gVal, cVal) in enumerate(zip(gEntry.doubles, cEntry.doubles)):
            # Skip NaN values in golden or candidate
            if math.isnan(gVal) or math.isnan(cVal):
                continue

            error = compute_error(cVal, gVal)

            # Track maximum absolute error
            if error.absolute > max_abs_error:
                max_abs_error = error.absolute
                max_abs_info = (gFilepath, valIndex, gVal, cVal, error.absolute, error.relative)

            # Track maximum relative error (only if it's not NaN)
            if not math.isnan(error.relative) and error.relative > max_rel_error:
                max_rel_error = error.relative
                max_rel_info = (gFilepath, valIndex, gVal, cVal, error.relative, error.absolute)

    return max_abs_info, max_rel_info

Copy link

qodo-merge-pro bot commented Aug 18, 2025

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
High-level
Potential O(N) rescans on failure

The new failure path calls find_maximum_errors during the first detected
violation, which rescans all files and all variables, potentially making a
single failure cost O(total dataset) work. If compare is used frequently or on
large packs, this could significantly degrade worst-case performance and violate
the “fails fast” intent. Consider precomputing maxima in the main loop
(amortized single pass) or gating the full scan behind a configurable flag to
preserve performance on large datasets.

Examples:

toolchain/mfc/packer/tol.py [48-66]
            def raise_err_with_max_diagnostics(msg: str):
                # Find maximum errors across ALL files for diagnostics
                max_errors = find_maximum_errors(candidate, golden)
                diagnostic_msg = format_diagnostic_message(max_errors)

                return None, f"""\
Variable n°{valIndex+1} (1-indexed) in {gFilepath} {msg}:
  - Candidate:   {cVal}
  - Golden:      {gVal}
  - Error:       {error}

 ... (clipped 9 lines)

Solution Walkthrough:

Before:

def compare(candidate, golden, tol):
    for gFilepath, gEntry in golden.entries.items():
        for valIndex, (gVal, cVal) in enumerate(zip(gEntry.doubles, cEntry.doubles)):
            # ...
            if not is_close(error, tol):
                # This triggers a full rescan of all files and variables
                max_errors = find_maximum_errors(candidate, golden)
                # ... format and return error
    return avg_err.get(), None

After:

def compare(candidate, golden, tol):
    first_failure = None
    max_abs_info, max_rel_info = None, None

    for gFilepath, gEntry in golden.entries.items():
        for valIndex, (gVal, cVal) in enumerate(zip(gEntry.doubles, cEntry.doubles)):
            # ... compute error
            # ... update max_abs_info and max_rel_info
            if first_failure is None and not is_close(error, tol):
                first_failure = (gFilepath, valIndex, ...)

    if first_failure is not None:
        # ... format and return error using first_failure and max error info
        # This avoids a second full scan.

    return avg_err.get(), None
Suggestion importance[1-10]: 9

__

Why: This suggestion correctly identifies a significant performance degradation in the failure case, where a full rescan of the data is triggered, contradicting the "fail fast" goal for performance.

High
General
Handle empty diagnostics case

Guard against the case where no comparable values are found, returning a clear
note instead of an empty string. This prevents confusing blank diagnostics when
all values are NaN or files are missing.

toolchain/mfc/packer/tol.py [72-98]

 def format_diagnostic_message(max_errors: typing.Tuple[typing.Optional[typing.Tuple[str, int, float, float, float, float]], typing.Optional[typing.Tuple[str, int, float, float, float, float]]]) -> str:
     """Format the diagnostic message showing maximum errors."""
     max_abs_info, max_rel_info = max_errors
     diagnostic_msg = ""
 
     if max_abs_info:
         filepath, var_idx, golden_val, candidate_val, abs_error, rel_error = max_abs_info
         rel_error_str = f"{rel_error:.2E}" if not math.isnan(rel_error) else "NaN"
         diagnostic_msg += f"\n\nDiagnostics - Maximum absolute error across ALL files:\n" \
                          f" - File: {filepath}\n" \
                          f" - Variable n°{var_idx+1}\n" \
                          f" - Candidate: {candidate_val}\n" \
                          f" - Golden: {golden_val}\n" \
                          f" - Absolute Error: {abs_error:.2E}\n" \
                          f" - Relative Error: {rel_error_str}"
 
     if max_rel_info:
         filepath, var_idx, golden_val, candidate_val, rel_error, abs_error = max_rel_info
         diagnostic_msg += f"\n\nDiagnostics - Maximum relative error across ALL files:\n" \
                          f" - File: {filepath}\n" \
                          f" - Variable n°{var_idx+1}\n" \
                          f" - Candidate: {candidate_val}\n" \
                          f" - Golden: {golden_val}\n" \
                          f" - Relative Error: {rel_error:.2E}\n" \
                          f" - Absolute Error: {abs_error:.2E}"
 
+    if not diagnostic_msg:
+        diagnostic_msg = "\n\nDiagnostics: No comparable values found across files."
+
     return diagnostic_msg
Suggestion importance[1-10]: 6

__

Why: The suggestion correctly identifies that an empty diagnostic message can be confusing and proposes adding a clear message, which improves the user experience of the error reporting.

Low
Use -inf for initial maxima

Initialize maxima with negative infinity to avoid incorrect comparisons when
actual errors can be negative or when domain assumptions change. This makes the
intent explicit and robust across edge cases.

toolchain/mfc/packer/tol.py [101-122]

 def find_maximum_errors(candidate: Pack, golden: Pack) -> typing.Tuple[typing.Optional[typing.Tuple[str, int, float, float, float, float]], typing.Optional[typing.Tuple[str, int, float, float, float, float]]]:
     ...
-    max_abs_error = -1.0
+    max_abs_error = float("-inf")
     max_abs_info = None
 
-    max_rel_error = -1.0
+    max_rel_error = float("-inf")
     max_rel_info = None
     ...
     for valIndex, (gVal, cVal) in enumerate(zip(gEntry.doubles, cEntry.doubles)):
         # Skip NaN values in golden or candidate
         if math.isnan(gVal) or math.isnan(cVal):
             continue
         ...
Suggestion importance[1-10]: 5

__

Why: This is a good practice suggestion that improves code robustness by initializing maxima with float("-inf"), which correctly handles all possible float values for errors.

Low
  • Update

@sbryngelson
Copy link
Member

just replace the existing error number with the new one?

@sbryngelson sbryngelson requested a review from Copilot August 18, 2025 12:49
@sbryngelson sbryngelson self-requested a review as a code owner August 18, 2025 12:50
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enhances the tolerance validation system in the MFC test suite by adding comprehensive error diagnostics when tolerance checks fail. The enhancement maintains the existing fast-fail behavior for performance while providing detailed analysis of numerical errors across all test files when failures occur.

Key changes:

  • Enhanced error reporting with maximum error analysis across all files
  • Added comprehensive diagnostic information including worst-case absolute and relative errors
  • Maintained fast-fail performance for passing tests


# Return the average relative error
return avg_err.get(), None


def format_diagnostic_message(max_errors: typing.Tuple[typing.Optional[typing.Tuple[str, int, float, float, float, float]], typing.Optional[typing.Tuple[str, int, float, float, float, float]]]) -> str:
Copy link
Preview

Copilot AI Aug 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function signature uses deeply nested tuple types that are difficult to read and maintain. Consider defining named tuple types or dataclasses for the error information to improve code clarity.

Copilot uses AI. Check for mistakes.

Scan all files to find the maximum absolute and relative errors.
Returns tuple of:
- max_abs_info: (filepath, var_index, golden_val, candidate_val, absolute_error, relative_error)
- max_rel_info: (filepath, var_index, golden_val, candidate_val, relative_error, absolute_error)
Copy link
Preview

Copilot AI Aug 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return type annotation uses the same deeply nested tuple structure as format_diagnostic_message. This complex type signature makes the code harder to understand and maintain. Consider using named tuples or dataclasses.

Suggested change
- max_rel_info: (filepath, var_index, golden_val, candidate_val, relative_error, absolute_error)
def format_diagnostic_message(max_errors: Tuple[Optional[ErrorInfo], Optional[ErrorInfo]]) -> str:
"""Format the diagnostic message showing maximum errors."""
max_abs_info, max_rel_info = max_errors
diagnostic_msg = ""
if max_abs_info:
rel_error_str = f"{max_abs_info.relative_error:.2E}" if not math.isnan(max_abs_info.relative_error) else "NaN"
diagnostic_msg += f"\n\nDiagnostics - Maximum absolute error across ALL files:\n" \
f" - File: {max_abs_info.filepath}\n" \
f" - Variable n°{max_abs_info.var_index+1}\n" \
f" - Candidate: {max_abs_info.candidate_val}\n" \
f" - Golden: {max_abs_info.golden_val}\n" \
f" - Absolute Error: {max_abs_info.absolute_error:.2E}\n" \
f" - Relative Error: {rel_error_str}"
if max_rel_info:
diagnostic_msg += f"\n\nDiagnostics - Maximum relative error across ALL files:\n" \
f" - File: {max_rel_info.filepath}\n" \
f" - Variable n°{max_rel_info.var_index+1}\n" \
f" - Candidate: {max_rel_info.candidate_val}\n" \
f" - Golden: {max_rel_info.golden_val}\n" \
f" - Relative Error: {max_rel_info.relative_error:.2E}\n" \
f" - Absolute Error: {max_rel_info.absolute_error:.2E}"
return diagnostic_msg
def find_maximum_errors(candidate: Pack, golden: Pack) -> Tuple[Optional[ErrorInfo], Optional[ErrorInfo]]:
"""
Scan all files to find the maximum absolute and relative errors.
Returns tuple of:
- max_abs_info: ErrorInfo or None
- max_rel_info: ErrorInfo or None

Copilot uses AI. Check for mistakes.

max_abs_error = -1.0
max_abs_info = None

max_rel_error = -1.0
Copy link
Preview

Copilot AI Aug 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using -1.0 as an initial value for maximum error tracking is unclear and could be problematic if all errors are actually negative (though unlikely in this context). Consider using float('-inf') or None with explicit checks for better clarity.

Suggested change
max_rel_error = -1.0
max_abs_error = float('-inf')
max_abs_info = None
max_rel_error = float('-inf')

Copilot uses AI. Check for mistakes.

max_abs_error = -1.0
max_abs_info = None

max_rel_error = -1.0
Copy link
Preview

Copilot AI Aug 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as with max_abs_error - using -1.0 as initial value is unclear. Consider using float('-inf') or None with explicit checks for better clarity.

Suggested change
max_rel_error = -1.0
max_abs_error = float('-inf')
max_abs_info = None
max_rel_error = float('-inf')

Copilot uses AI. Check for mistakes.

Copy link

codecov bot commented Aug 18, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 40.93%. Comparing base (88c0a11) to head (34c618a).

Additional details and impacted files
@@           Coverage Diff           @@
##           master     #988   +/-   ##
=======================================
  Coverage   40.93%   40.93%           
=======================================
  Files          70       70           
  Lines       20288    20288           
  Branches     2517     2517           
=======================================
  Hits         8305     8305           
  Misses      10447    10447           
  Partials     1536     1536           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

2 participants