Skip to content

feat: handle masked forces in test #4893

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: devel
Choose a base branch
from

Conversation

caic99
Copy link
Member

@caic99 caic99 commented Aug 15, 2025

Summary by CodeRabbit

  • New Features

    • Added per-atom weighting (masking) for force error metrics in evaluation. When provided, weighted MAE/RMSE are computed and reported, reflecting only the selected atoms. Handles zero-weight cases gracefully.
  • Tests

    • Introduced end-to-end validation ensuring masked-atom force metrics align with evaluator outputs, improving confidence in results.

@caic99 caic99 requested a review from anyangml August 15, 2025 08:17
Copy link

codecov bot commented Aug 15, 2025

Codecov Report

❌ Patch coverage is 87.50000% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.29%. Comparing base (dc99ba2) to head (13d43e2).
⚠️ Report is 1 commits behind head on devel.

Files with missing lines Patch % Lines
deepmd/entrypoints/test.py 87.50% 2 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##            devel    #4893   +/-   ##
=======================================
  Coverage   84.29%   84.29%           
=======================================
  Files         702      702           
  Lines       68664    68677   +13     
  Branches     3572     3572           
=======================================
+ Hits        57883    57894   +11     
- Misses       9641     9643    +2     
  Partials     1140     1140           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@caic99 caic99 marked this pull request as ready for review August 18, 2025 02:26
@Copilot Copilot AI review requested due to automatic review settings August 18, 2025 02:26
Copilot

This comment was marked as outdated.

Copy link
Contributor

coderabbitai bot commented Aug 18, 2025

📝 Walkthrough

Walkthrough

Adds optional per-atom weighting for force error metrics in deepmd/entrypoints/test.py using atom_pref and find_atom_pref. Updates returned MAE/RMSE tuples to reflect weighted sample size. Introduces a new PyTorch test validating masked-atom force metrics end-to-end against dp_test_ener.

Changes

Cohort / File(s) Summary of Changes
Weighted force metrics in dp_test_ener
deepmd/entrypoints/test.py
Adds atom_pref data field and find_atom_pref flag. Computes weighted force residuals, MAE, RMSE, and effective sample size (size_f). Updates returned metric tuples to use size_f; falls back to unweighted when disabled.
Force-mask test workflow
source/tests/pt/test_dp_test.py
Adds TestDPTestForceMask with setup/teardown. Prepares masked system (atom_pref), runs model eval with DeepEval/DeepmdData, compares computed masked MAE/RMSE with dp_test_ener outputs. Saves/loads scripted model for the test.

Sequence Diagram(s)

sequenceDiagram
  participant Tester as dp_test_ener
  participant Data as test_data
  participant Metrics as MetricCalc

  Tester->>Data: get(force, test_forces, atom_pref, find_atom_pref)
  alt find_atom_pref == 1
    Tester->>Metrics: compute weighted diff_f = (force - test_forces) * atom_pref
    Metrics-->>Tester: mae_f, rmse_f, size_f = sum(atom_pref)
  else
    Tester->>Metrics: compute unweighted diff_f = (force - test_forces)
    Metrics-->>Tester: mae_f, rmse_f, size_f = force.size
  end
  Tester-->>Tester: package results with (value, size_f)
Loading
sequenceDiagram
  participant Test as TestDPTestForceMask
  participant Prep as _prepare_masked_system
  participant Trainer as Trainer/Script
  participant Eval as DeepEval
  participant Loader as DeepmdData

  Test->>Prep: duplicate data, set atom_pref mask
  Test->>Trainer: train & script model, save/load
  Test->>Loader: load masked test set
  Test->>Eval: run dp_test_ener on test data
  Eval-->>Test: returns mae_f, rmse_f with size_f
  Test-->>Test: independently compute masked MAE/RMSE
  Test-->>Test: assert equality with dp_test_ener
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • iProzd
  • njzjz

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
deepmd/entrypoints/test.py (1)

421-435: Weighted force MAE/RMSE look correct; add a shape guard for robustness

The weighting logic (sum of abs residuals / sum of weights and sqrt(weighted MSE)) is sound, and size_f=sum(weights) makes the system-level aggregation consistent. To tolerate atom_pref provided as (nframes, natoms) without depending on repeat=3 at load time, add a defensive broadcast/validation.

-    if find_atom_pref == 1:
-        atom_pref = test_data["atom_pref"][:numb_test]
-        diff_f = diff_f * atom_pref
+    if find_atom_pref == 1:
+        atom_pref = test_data["atom_pref"][:numb_test]
+        # Accept both (nframes, natoms) and (nframes, natoms*3)
+        if atom_pref.shape != diff_f.shape:
+            if atom_pref.shape[1] * 3 == diff_f.shape[1]:
+                atom_pref = np.repeat(atom_pref, 3, axis=1)
+            else:
+                raise ValueError(
+                    f"atom_pref shape {atom_pref.shape} is incompatible with force shape {diff_f.shape}"
+                )
+        diff_f = diff_f * atom_pref
         size_f = np.sum(atom_pref)
         if size_f > 0:
             mae_f = np.sum(np.abs(diff_f)) / size_f
             rmse_f = np.sqrt(np.sum(diff_f * diff_f) / size_f)
         else:
             mae_f = 0.0
             rmse_f = 0.0

If desired, we can also factor this into small helpers (weighted_mae/weighted_rmse) to keep this block concise.

source/tests/pt/test_dp_test.py (2)

168-181: Mask preparation is correct; tiny robustness note about copytree target

Copying the sample system into a mkdtemp dir and writing atom_pref.npy as (nframes, natoms) is consistent with repeat=3 in the loader. Setting the last atom’s weight to 0 validates masking well. One minor robustness nit: shutil.copytree(..., dirs_exist_ok=True) is fine here, but if src ever contains symlinks or permissions need preserving, consider copy_function/copy2. Not required for this test data.


182-230: Good end-to-end assertions; add tolerances and verify the reported sample size

The comparison path is sound: run dp_test_ener, independently compute the masked MAE/RMSE from dp.eval outputs, and compare. To avoid flaky failures from float32 arithmetic and to verify the new size_f behavior, consider:

  • Use explicit tolerances in assert_allclose.
  • Assert that the returned sizes equal the mask sum.
-        np.testing.assert_allclose(err["mae_f"][0], mae_expected)
-        np.testing.assert_allclose(err["rmse_f"][0], rmse_expected)
+        np.testing.assert_allclose(err["mae_f"][0], mae_expected, rtol=1e-6, atol=1e-8)
+        np.testing.assert_allclose(err["rmse_f"][0], rmse_expected, rtol=1e-6, atol=1e-8)
+        # Also verify the effective sample size matches the mask sum
+        np.testing.assert_equal(err["mae_f"][1], denom)
+        np.testing.assert_equal(err["rmse_f"][1], denom)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between dc99ba2 and 13d43e2.

📒 Files selected for processing (2)
  • deepmd/entrypoints/test.py (4 hunks)
  • source/tests/pt/test_dp_test.py (2 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
deepmd/entrypoints/test.py (4)
deepmd/utils/data_system.py (1)
  • add (342-395)
deepmd/utils/data.py (1)
  • add (136-189)
deepmd/pt/utils/stat.py (1)
  • rmse (525-526)
deepmd/dpmodel/output_def.py (1)
  • size (230-231)
🪛 Ruff (0.12.2)
source/tests/pt/test_dp_test.py

187-187: Use a context manager for opening files

(SIM115)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (29)
  • GitHub Check: Analyze (python)
  • GitHub Check: Build wheels for cp311-macosx_arm64
  • GitHub Check: Analyze (c-cpp)
  • GitHub Check: Build C library (2.14, >=2.5.0,<2.15, libdeepmd_c_cu11.tar.gz)
  • GitHub Check: Build wheels for cp311-win_amd64
  • GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
  • GitHub Check: Build wheels for cp310-manylinux_aarch64
  • GitHub Check: Test C++ (false)
  • GitHub Check: Build wheels for cp311-macosx_x86_64
  • GitHub Check: Test C++ (true)
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Build C++ (rocm, rocm)
  • GitHub Check: Test Python (5, 3.12)
  • GitHub Check: Build C++ (clang, clang)
  • GitHub Check: Test Python (3, 3.9)
  • GitHub Check: Test Python (4, 3.12)
  • GitHub Check: Test Python (6, 3.12)
  • GitHub Check: Build C++ (cuda120, cuda)
  • GitHub Check: Build C++ (cuda, cuda)
  • GitHub Check: Test Python (2, 3.12)
  • GitHub Check: Build C++ (cpu, cpu)
  • GitHub Check: Test Python (1, 3.12)
  • GitHub Check: Test Python (5, 3.9)
  • GitHub Check: Test Python (6, 3.9)
  • GitHub Check: Test Python (2, 3.9)
  • GitHub Check: Test Python (1, 3.9)
  • GitHub Check: Test Python (3, 3.12)
  • GitHub Check: Test Python (4, 3.9)
🔇 Additional comments (6)
deepmd/entrypoints/test.py (3)

294-295: Correctly declares atom-level preference mask aligned with force components

Adding atom_pref as an atomic item with repeat=3 matches the flattened (natoms*3) force layout. Keeping must=False ensures backward compatibility (feature enabled only when the file is present).


317-317: Presence flag retrieval is consistent with existing pattern

Using test_data.get("find_atom_pref") mirrors the existing find_* flags and safely disables weighting when absent.


470-471: Return sizes reflect the effective weighted sample size

Reporting (value, size_f) for mae_f/rmse_f is the right choice for weighted aggregation across systems. No change needed.

source/tests/pt/test_dp_test.py (3)

18-21: Imports for dp_test_ener/DeepEval/DeepmdData are appropriate

These imports are minimal and scoped to the new test’s needs.

Also applies to: 28-30


150-167: Solid setup for masked-force test system

Bootstrapping a tiny training config and pointing both training/validation to the prepared system keeps the test self-contained. No issues spotted.


231-234: Clean teardown

Delegating to the base teardown and removing the temp system directory prevents test artifacts from leaking. Looks good.

@anyangml anyangml requested a review from iProzd August 18, 2025 03:00
Copy link
Collaborator

@anyangml anyangml left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@caic99 caic99 requested a review from Copilot August 19, 2025 05:42
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds support for masked forces in testing by implementing per-atom weighting for force error metrics. When atom preferences (masks) are provided, the evaluation computes weighted MAE/RMSE that only consider selected atoms, with graceful handling of zero-weight cases.

Key changes:

  • Modified force error calculation to apply atom-level masking when atom_pref data is available
  • Added comprehensive test coverage for masked force evaluation scenarios
  • Updated error metric reporting to reflect the actual number of atoms considered in calculations

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
deepmd/entrypoints/test.py Implements masked force error calculation with weighted MAE/RMSE computation
source/tests/pt/test_dp_test.py Adds end-to-end test validating masked force metrics against expected calculations

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Copy link
Member

@njzjz njzjz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR assumes the atom_pref is the mask, but it may not as it is not limited to 0 and 1. One may set it to something like 1 1 1 1 5 1 1 1 to let some atom has larger weights.

mae_f = mae(diff_f)
rmse_f = rmse(diff_f)
if find_atom_pref == 1:
atom_pref = test_data["atom_pref"][:numb_test]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
atom_pref = test_data["atom_pref"][:numb_test]
atom_pref = test_data["atom_pref"][:numb_test].astype(bool)

@njzjz Is this what you mean?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I mean atom_pref was not designed (when it was proposed in #41) as the mask. One may use it for other purposes and doesn't expect to see a masked result.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps the solution is providing both original results and the results multiply atom_pref

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t see any reason why the current implementation cannot handle arbitrary weights. The only question is whether the weights should be considered when calculating the error; my intuition is that they should not.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And in that case, we can simply convert them to bool type.

@caic99 caic99 requested a review from njzjz August 22, 2025 03:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants