Skip to content

fix(c++): fix NULL type in custom op #4889

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 4 commits into
base: devel
Choose a base branch
from

Conversation

iProzd
Copy link
Collaborator

@iProzd iProzd commented Aug 14, 2025

Replaces usage of lmp_list send/recv arrays with new vectors that map indices using fwd_map and synchronize counts via MPI. Updates tensor construction to use these new vectors, improving correctness and flexibility in distributed communication.

Summary by CodeRabbit

  • Refactor
    • Reworked MPI-backed message passing for distributed runs, improving scalability, stability, and consistency without changing the public interface.
  • Bug Fixes
    • Prevented errors from invalid or mismatched send indices by remapping/discarding them and correcting receive counts and ordering.
    • Improved behavior when an MPI world/communicator is unavailable to avoid failures during distributed execution.

Replaces usage of lmp_list send/recv arrays with new vectors that map indices using fwd_map and synchronize counts via MPI. Updates tensor construction to use these new vectors, improving correctness and flexibility in distributed communication.
@iProzd iProzd marked this pull request as draft August 14, 2025 13:26
@github-actions github-actions bot added the C++ label Aug 14, 2025
Copy link
Contributor

coderabbitai bot commented Aug 14, 2025

📝 Walkthrough

Walkthrough

Implements MPI-gated remapped message-passing in DeepPotPT::compute: introduces new send/recv count and list arrays, maps indices via fwd_map, exchanges recv counts via MPI_Sendrecv (TAG_BASE 0x7a31) when world exists, computes prefix sums, rebuilds Torch tensors from new arrays, updates comm_dict, and conditionally includes MPI headers.

Changes

Cohort / File(s) Summary
MPI remapped communication path
source/api_cc/src/DeepPotPT.cc
- Conditionally includes mpi.h and mpi-ext.h under USE_MPI/MPI_FOUND.
- Adds remapped arrays: sendnum_new, sendlist_new, recvnum_new, firstrecv_new built via fwd_map and compaction.
- Uses MPI_Sendrecv (TAG_BASE=0x7a31) to obtain recvnum_new when lmp_list.world exists; mirrors send counts otherwise.
- Computes firstrecv_new as prefix sum of recvnum_new.
- Rebuilds tensors: firstrecv_tensor, recvnum_tensor, sendnum_tensor, sendlist_tensor from new arrays; uses static_cast<long> for sizes.
- Populates comm_dict with updated tensors: "send_list", "send_proc", "recv_proc", "send_num", "recv_num", "communicator".
- Comments out prior usage of lmp_list.firstrecv/recvnum/sendnum and blob-based sendlist construction.
- No public API signature changes; minor structural/formatting updates.

Sequence Diagram(s)

sequenceDiagram
  participant DP as DeepPotPT::compute
  participant MP as MPI World (lmp_list.world)
  participant Map as fwd_map
  participant T as Torch Tensors

  DP->>Map: Map old send indices -> forwarded indices
  Map-->>DP: sendlist_new, sendnum_new (invalids dropped)

  alt world exists
    DP->>MP: MPI_Sendrecv(sendnum_new) [TAG_BASE=0x7a31]
    MP-->>DP: recvnum_new
  else no world
    DP-->>DP: recvnum_new = sendnum_new
  end

  DP-->>DP: firstrecv_new = prefix_sum(recvnum_new)
  DP->>T: Build tensors (firstrecv, recvnum, sendnum, sendlist)
  DP-->>DP: Update comm_dict with new tensors
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
source/api_cc/src/DeepPotPT.cc (2)

251-257: Remove stale commented-out code

Dead commented code obscures the current data path and makes maintenance harder.

Apply this diff:

-//      torch::Tensor firstrecv_tensor =
-//          torch::from_blob(lmp_list.firstrecv, {nswap}, int32_option);
-//      torch::Tensor recvnum_tensor =
-//          torch::from_blob(lmp_list.recvnum, {nswap}, int32_option);
-//      torch::Tensor sendnum_tensor =
-//          torch::from_blob(lmp_list.sendnum, {nswap}, int32_option);

266-269: Remove redundant commented-out legacy code

Same reasoning; commented legacy path is preserved in git history.

Apply this diff:

-//      int total_send =
-//          std::accumulate(lmp_list.sendnum, lmp_list.sendnum + nswap, 0);
-//      torch::Tensor sendlist_tensor =
-//          torch::from_blob(lmp_list.sendlist, {total_send}, int32_option);
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these settings in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between accc331 and f99ad4d.

📒 Files selected for processing (1)
  • source/api_cc/src/DeepPotPT.cc (3 hunks)
🔇 Additional comments (2)
source/api_cc/src/DeepPotPT.cc (2)

185-204: Remapping logic LGTM

Correctly rebuilds per-swap send counts and a dense send list using fwd_map with bounds checks and filtering. Reserving capacity via the accumulated legacy counts is a good optimization.


226-232: firstrecv_new is unused and not required — original comment is incorrect

Short: deepmd/pt/model/descriptor/repflows.py builds comm_dict and calls torch.ops.deepmd.border_op with send_list, send_proc, recv_proc, send_num, recv_num, communicator (no first_recv). The computed firstrecv_new/firstrecv_tensor in the PT wrappers is dead code — remove it or document why it is kept.

Files to update:

  • source/api_cc/src/DeepPotPT.cc
    • Remove the firstrecv_new prefix-sum computation (around lines 226–231) and the unused firstrecv_tensor creation (around line 238).
  • source/api_cc/src/DeepSpinPT.cc
    • Same pattern: firstrecv_tensor is created around lines 187–191 but never used/inserted.

Suggested change (remove unused code) — example diff for DeepPotPT.cc:
@@

  •     std::vector<int> firstrecv_new(nswap, 0);
    
  •      int acc = 0;
    
  •      for (int s = 0; s < nswap; ++s) {
    
  •        firstrecv_new[s] = acc;
    
  •        acc += recvnum_new[s];
    
  •      }
    
  •     /* firstrecv computation removed — not used by border_op */
    

@@

  •    torch::Tensor firstrecv_tensor =
    
  •  torch::from_blob(firstrecv_new.data(), {nswap}, int32_option).clone();
    
  •    /* firstrecv tensor omitted — border_op expects recv_num, not first_recv */
    

If you prefer to keep the computation for clarity, add a short comment explaining it's intentionally unused.

Likely an incorrect or invalid review comment.

Comment on lines 13 to 18
#ifdef USE_MPI
#include <mpi.h>
#ifdef OMPI_MPI_H
#include <mpi-ext.h>
#endif
#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Use a single MPI compile guard and include the right headers consistently

You include MPI headers under USE_MPI, but use MPI_FOUND to gate MPI calls later. Pick one macro to avoid configuration skew; MPI_FOUND matches the usage below.

Apply this diff to unify guards:

-#ifdef USE_MPI
+#ifdef MPI_FOUND
 #include <mpi.h>
 #ifdef OMPI_MPI_H
 #include <mpi-ext.h>
 #endif
 #endif

Additionally, since this file uses std::accumulate, ensure is included (outside this hunk):

// near the other standard headers
#include <numeric>
🤖 Prompt for AI Agents
In source/api_cc/src/DeepPotPT.cc around lines 13 to 18, the MPI headers are
currently guarded by USE_MPI while the rest of the file uses MPI_FOUND; change
the compile guard to MPI_FOUND so the same macro is used everywhere (wrap
#include <mpi.h> and conditional #include <mpi-ext.h> with #ifdef MPI_FOUND ...
#endif), and make sure to include the standard header <numeric> near the other
standard includes (outside this hunk) so std::accumulate is available.

Comment on lines 205 to 225
std::vector<int> recvnum_new(nswap, 0);
#ifdef MPI_FOUND
if (lmp_list.world) {
MPI_Comm comm = *static_cast<MPI_Comm*>(lmp_list.world);
const int TAG_BASE = 0x7a31;
for (int s = 0; s < nswap; ++s) {
const int send_to = lmp_list.sendproc[s];
const int recv_from = lmp_list.recvproc[s];
int send_cnt = sendnum_new[s];
int recv_cnt = 0;
MPI_Sendrecv(&send_cnt, 1, MPI_INT, send_to, TAG_BASE + s,
&recv_cnt, 1, MPI_INT, recv_from, TAG_BASE + s,
comm, MPI_STATUS_IGNORE);
recvnum_new[s] = recv_cnt;
}
} else
#endif
{
for (int s = 0; s < nswap; ++s) recvnum_new[s] = sendnum_new[s];
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Check MPI_Sendrecv return codes and handle MPI_PROC_NULL safely

Currently the MPI calls ignore return codes; if a peer is mismatched you’ll silently compute wrong recv counts. Also, peers can legally be MPI_PROC_NULL; guard for that to avoid unnecessary calls.

Apply this diff to add error checking and skip PROC_NULL peers:

-            for (int s = 0; s < nswap; ++s) {
-              const int send_to   = lmp_list.sendproc[s];
-              const int recv_from = lmp_list.recvproc[s];
-              int send_cnt = sendnum_new[s];
-              int recv_cnt = 0;
-              MPI_Sendrecv(&send_cnt, 1, MPI_INT, send_to,   TAG_BASE + s,
-                           &recv_cnt, 1, MPI_INT, recv_from, TAG_BASE + s,
-                           comm, MPI_STATUS_IGNORE);
-              recvnum_new[s] = recv_cnt;
-            }
+            for (int s = 0; s < nswap; ++s) {
+              const int send_to   = lmp_list.sendproc[s];
+              const int recv_from = lmp_list.recvproc[s];
+              int send_cnt = sendnum_new[s];
+              int recv_cnt = 0;
+              // MPI_PROC_NULL is allowed and acts as a no-op
+              int mpi_err = MPI_Sendrecv(&send_cnt, 1, MPI_INT, send_to,   TAG_BASE + s,
+                                         &recv_cnt, 1, MPI_INT, recv_from, TAG_BASE + s,
+                                         comm, MPI_STATUS_IGNORE);
+              if (mpi_err != MPI_SUCCESS) {
+                char errstr[MPI_MAX_ERROR_STRING];
+                int  len = 0;
+                MPI_Error_string(mpi_err, errstr, &len);
+                throw deepmd::deepmd_exception(
+                    std::string("MPI_Sendrecv(send_cnt/recv_cnt) failed: ")
+                    + std::string(errstr, len));
+              }
+              recvnum_new[s] = recv_cnt;
+            }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
std::vector<int> recvnum_new(nswap, 0);
#ifdef MPI_FOUND
if (lmp_list.world) {
MPI_Comm comm = *static_cast<MPI_Comm*>(lmp_list.world);
const int TAG_BASE = 0x7a31;
for (int s = 0; s < nswap; ++s) {
const int send_to = lmp_list.sendproc[s];
const int recv_from = lmp_list.recvproc[s];
int send_cnt = sendnum_new[s];
int recv_cnt = 0;
MPI_Sendrecv(&send_cnt, 1, MPI_INT, send_to, TAG_BASE + s,
&recv_cnt, 1, MPI_INT, recv_from, TAG_BASE + s,
comm, MPI_STATUS_IGNORE);
recvnum_new[s] = recv_cnt;
}
} else
#endif
{
for (int s = 0; s < nswap; ++s) recvnum_new[s] = sendnum_new[s];
}
std::vector<int> recvnum_new(nswap, 0);
#ifdef MPI_FOUND
if (lmp_list.world) {
MPI_Comm comm = *static_cast<MPI_Comm*>(lmp_list.world);
const int TAG_BASE = 0x7a31;
for (int s = 0; s < nswap; ++s) {
const int send_to = lmp_list.sendproc[s];
const int recv_from = lmp_list.recvproc[s];
int send_cnt = sendnum_new[s];
int recv_cnt = 0;
// MPI_PROC_NULL is allowed and acts as a no-op
int mpi_err = MPI_Sendrecv(&send_cnt, 1, MPI_INT, send_to, TAG_BASE + s,
&recv_cnt, 1, MPI_INT, recv_from, TAG_BASE + s,
comm, MPI_STATUS_IGNORE);
if (mpi_err != MPI_SUCCESS) {
char errstr[MPI_MAX_ERROR_STRING];
int len = 0;
MPI_Error_string(mpi_err, errstr, &len);
throw deepmd::deepmd_exception(
std::string("MPI_Sendrecv(send_cnt/recv_cnt) failed: ")
+ std::string(errstr, len));
}
recvnum_new[s] = recv_cnt;
}
} else
#endif
{
for (int s = 0; s < nswap; ++s) recvnum_new[s] = sendnum_new[s];
}
🤖 Prompt for AI Agents
In source/api_cc/src/DeepPotPT.cc around lines 205-225, the MPI_Sendrecv calls
neither check return codes nor handle MPI_PROC_NULL peers; update the loop to
first check if send_to==MPI_PROC_NULL or recv_from==MPI_PROC_NULL and in that
case set recvnum_new[s]=0 and skip the MPI call, otherwise call MPI_Sendrecv and
capture its return value, compare against MPI_SUCCESS and on non-success log or
propagate an error and set recvnum_new[s]=0 (or handle per project error policy)
to avoid using uninitialized/incorrect counts; use MPI constants (MPI_PROC_NULL,
MPI_SUCCESS) and ensure the TAG_BASE logic is unchanged.

Comment on lines 238 to 249
torch::Tensor firstrecv_tensor =
torch::from_blob(firstrecv_new.data(), {nswap}, int32_option).clone();
torch::Tensor recvnum_tensor =
torch::from_blob(recvnum_new.data(), {nswap}, int32_option).clone();
torch::Tensor sendnum_tensor =
torch::from_blob(sendnum_new.data(), {nswap}, int32_option).clone();

torch::Tensor sendlist_tensor =
torch::from_blob(sendlist_new.data(),
{ static_cast<long>(sendlist_new.size()) },
int32_option).clone();

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Use 64-bit shape type for Torch sizes and ensure consistency

Shapes in from_blob/view are int64_t; casting to long can truncate on LLP64 platforms (Windows).

Apply this diff:

-  torch::Tensor firstrecv_tensor =
-      torch::from_blob(firstrecv_new.data(), {nswap}, int32_option).clone();
-  torch::Tensor recvnum_tensor  =
-      torch::from_blob(recvnum_new.data(),  {nswap}, int32_option).clone();
-  torch::Tensor sendnum_tensor  =
-      torch::from_blob(sendnum_new.data(),  {nswap}, int32_option).clone();
+  torch::Tensor firstrecv_tensor =
+      torch::from_blob(firstrecv_new.data(), {static_cast<std::int64_t>(nswap)}, int32_option).clone();
+  torch::Tensor recvnum_tensor  =
+      torch::from_blob(recvnum_new.data(),  {static_cast<std::int64_t>(nswap)}, int32_option).clone();
+  torch::Tensor sendnum_tensor  =
+      torch::from_blob(sendnum_new.data(),  {static_cast<std::int64_t>(nswap)}, int32_option).clone();
@@
   torch::Tensor sendlist_tensor =
       torch::from_blob(sendlist_new.data(),
-                       { static_cast<long>(sendlist_new.size()) },
+                       { static_cast<std::int64_t>(sendlist_new.size()) },
                        int32_option).clone();

Optional safety (outside this hunk): for consistency with the cloned new arrays, consider cloning sendproc/recvproc tensors as well, since they are built from external buffers:

torch::Tensor sendproc_tensor = torch::from_blob(lmp_list.sendproc, {static_cast<std::int64_t>(nswap)}, int32_option).clone();
torch::Tensor recvproc_tensor = torch::from_blob(lmp_list.recvproc, {static_cast<std::int64_t>(nswap)}, int32_option).clone();
🤖 Prompt for AI Agents
In source/api_cc/src/DeepPotPT.cc around lines 238 to 249, the torch::from_blob
shape arguments use casts to long which can truncate on LLP64 platforms; change
all shape casts to static_cast<std::int64_t> (e.g.,
static_cast<std::int64_t>(nswap) and
static_cast<std::int64_t>(sendlist_new.size())) when constructing tensors from
blobs, so sizes match Torch's int64_t expectation; optionally also construct and
clone sendproc/recvproc tensors from their buffers using the same std::int64_t
casts to ensure they are safe copies of external memory.

@@ -10,6 +10,13 @@
#include "device.h"
#include "errors.h"

#ifdef USE_MPI
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be given in the CMake file. See op

target_compile_definitions(deepmd_op_pt PRIVATE USE_MPI)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MPI also needs to be linked, if used.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

target_link_libraries(deepmd_op_pt PRIVATE MPI::MPI_CXX)

#ifdef USE_MPI
#include <mpi.h>
#ifdef OMPI_MPI_H
#include <mpi-ext.h>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed.

@caic99 caic99 linked an issue Aug 22, 2025 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

NULL atom type can not be used with deepmd 3.1.0a0 pth
2 participants