-
Notifications
You must be signed in to change notification settings - Fork 569
fix(c++): fix NULL type in custom op #4889
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: devel
Are you sure you want to change the base?
Conversation
Replaces usage of lmp_list send/recv arrays with new vectors that map indices using fwd_map and synchronize counts via MPI. Updates tensor construction to use these new vectors, improving correctness and flexibility in distributed communication.
for more information, see https://pre-commit.ci
📝 WalkthroughWalkthroughImplements MPI-gated remapped message-passing in DeepPotPT::compute: introduces new send/recv count and list arrays, maps indices via fwd_map, exchanges recv counts via MPI_Sendrecv (TAG_BASE 0x7a31) when world exists, computes prefix sums, rebuilds Torch tensors from new arrays, updates comm_dict, and conditionally includes MPI headers. Changes
Sequence Diagram(s)sequenceDiagram
participant DP as DeepPotPT::compute
participant MP as MPI World (lmp_list.world)
participant Map as fwd_map
participant T as Torch Tensors
DP->>Map: Map old send indices -> forwarded indices
Map-->>DP: sendlist_new, sendnum_new (invalids dropped)
alt world exists
DP->>MP: MPI_Sendrecv(sendnum_new) [TAG_BASE=0x7a31]
MP-->>DP: recvnum_new
else no world
DP-->>DP: recvnum_new = sendnum_new
end
DP-->>DP: firstrecv_new = prefix_sum(recvnum_new)
DP->>T: Build tensors (firstrecv, recvnum, sendnum, sendlist)
DP-->>DP: Update comm_dict with new tensors
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (2)
source/api_cc/src/DeepPotPT.cc (2)
251-257
: Remove stale commented-out codeDead commented code obscures the current data path and makes maintenance harder.
Apply this diff:
-// torch::Tensor firstrecv_tensor = -// torch::from_blob(lmp_list.firstrecv, {nswap}, int32_option); -// torch::Tensor recvnum_tensor = -// torch::from_blob(lmp_list.recvnum, {nswap}, int32_option); -// torch::Tensor sendnum_tensor = -// torch::from_blob(lmp_list.sendnum, {nswap}, int32_option);
266-269
: Remove redundant commented-out legacy codeSame reasoning; commented legacy path is preserved in git history.
Apply this diff:
-// int total_send = -// std::accumulate(lmp_list.sendnum, lmp_list.sendnum + nswap, 0); -// torch::Tensor sendlist_tensor = -// torch::from_blob(lmp_list.sendlist, {total_send}, int32_option);
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these settings in your CodeRabbit configuration.
📒 Files selected for processing (1)
source/api_cc/src/DeepPotPT.cc
(3 hunks)
🔇 Additional comments (2)
source/api_cc/src/DeepPotPT.cc (2)
185-204
: Remapping logic LGTMCorrectly rebuilds per-swap send counts and a dense send list using fwd_map with bounds checks and filtering. Reserving capacity via the accumulated legacy counts is a good optimization.
226-232
: firstrecv_new is unused and not required — original comment is incorrectShort: deepmd/pt/model/descriptor/repflows.py builds comm_dict and calls torch.ops.deepmd.border_op with send_list, send_proc, recv_proc, send_num, recv_num, communicator (no first_recv). The computed firstrecv_new/firstrecv_tensor in the PT wrappers is dead code — remove it or document why it is kept.
Files to update:
- source/api_cc/src/DeepPotPT.cc
- Remove the firstrecv_new prefix-sum computation (around lines 226–231) and the unused firstrecv_tensor creation (around line 238).
- source/api_cc/src/DeepSpinPT.cc
- Same pattern: firstrecv_tensor is created around lines 187–191 but never used/inserted.
Suggested change (remove unused code) — example diff for DeepPotPT.cc:
@@
std::vector<int> firstrecv_new(nswap, 0);
int acc = 0;
for (int s = 0; s < nswap; ++s) {
firstrecv_new[s] = acc;
acc += recvnum_new[s];
}
/* firstrecv computation removed — not used by border_op */
@@
torch::Tensor firstrecv_tensor =
torch::from_blob(firstrecv_new.data(), {nswap}, int32_option).clone();
/* firstrecv tensor omitted — border_op expects recv_num, not first_recv */
If you prefer to keep the computation for clarity, add a short comment explaining it's intentionally unused.
Likely an incorrect or invalid review comment.
source/api_cc/src/DeepPotPT.cc
Outdated
#ifdef USE_MPI | ||
#include <mpi.h> | ||
#ifdef OMPI_MPI_H | ||
#include <mpi-ext.h> | ||
#endif | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Use a single MPI compile guard and include the right headers consistently
You include MPI headers under USE_MPI, but use MPI_FOUND to gate MPI calls later. Pick one macro to avoid configuration skew; MPI_FOUND matches the usage below.
Apply this diff to unify guards:
-#ifdef USE_MPI
+#ifdef MPI_FOUND
#include <mpi.h>
#ifdef OMPI_MPI_H
#include <mpi-ext.h>
#endif
#endif
Additionally, since this file uses std::accumulate, ensure is included (outside this hunk):
// near the other standard headers
#include <numeric>
🤖 Prompt for AI Agents
In source/api_cc/src/DeepPotPT.cc around lines 13 to 18, the MPI headers are
currently guarded by USE_MPI while the rest of the file uses MPI_FOUND; change
the compile guard to MPI_FOUND so the same macro is used everywhere (wrap
#include <mpi.h> and conditional #include <mpi-ext.h> with #ifdef MPI_FOUND ...
#endif), and make sure to include the standard header <numeric> near the other
standard includes (outside this hunk) so std::accumulate is available.
source/api_cc/src/DeepPotPT.cc
Outdated
std::vector<int> recvnum_new(nswap, 0); | ||
#ifdef MPI_FOUND | ||
if (lmp_list.world) { | ||
MPI_Comm comm = *static_cast<MPI_Comm*>(lmp_list.world); | ||
const int TAG_BASE = 0x7a31; | ||
for (int s = 0; s < nswap; ++s) { | ||
const int send_to = lmp_list.sendproc[s]; | ||
const int recv_from = lmp_list.recvproc[s]; | ||
int send_cnt = sendnum_new[s]; | ||
int recv_cnt = 0; | ||
MPI_Sendrecv(&send_cnt, 1, MPI_INT, send_to, TAG_BASE + s, | ||
&recv_cnt, 1, MPI_INT, recv_from, TAG_BASE + s, | ||
comm, MPI_STATUS_IGNORE); | ||
recvnum_new[s] = recv_cnt; | ||
} | ||
} else | ||
#endif | ||
{ | ||
for (int s = 0; s < nswap; ++s) recvnum_new[s] = sendnum_new[s]; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Check MPI_Sendrecv return codes and handle MPI_PROC_NULL safely
Currently the MPI calls ignore return codes; if a peer is mismatched you’ll silently compute wrong recv counts. Also, peers can legally be MPI_PROC_NULL; guard for that to avoid unnecessary calls.
Apply this diff to add error checking and skip PROC_NULL peers:
- for (int s = 0; s < nswap; ++s) {
- const int send_to = lmp_list.sendproc[s];
- const int recv_from = lmp_list.recvproc[s];
- int send_cnt = sendnum_new[s];
- int recv_cnt = 0;
- MPI_Sendrecv(&send_cnt, 1, MPI_INT, send_to, TAG_BASE + s,
- &recv_cnt, 1, MPI_INT, recv_from, TAG_BASE + s,
- comm, MPI_STATUS_IGNORE);
- recvnum_new[s] = recv_cnt;
- }
+ for (int s = 0; s < nswap; ++s) {
+ const int send_to = lmp_list.sendproc[s];
+ const int recv_from = lmp_list.recvproc[s];
+ int send_cnt = sendnum_new[s];
+ int recv_cnt = 0;
+ // MPI_PROC_NULL is allowed and acts as a no-op
+ int mpi_err = MPI_Sendrecv(&send_cnt, 1, MPI_INT, send_to, TAG_BASE + s,
+ &recv_cnt, 1, MPI_INT, recv_from, TAG_BASE + s,
+ comm, MPI_STATUS_IGNORE);
+ if (mpi_err != MPI_SUCCESS) {
+ char errstr[MPI_MAX_ERROR_STRING];
+ int len = 0;
+ MPI_Error_string(mpi_err, errstr, &len);
+ throw deepmd::deepmd_exception(
+ std::string("MPI_Sendrecv(send_cnt/recv_cnt) failed: ")
+ + std::string(errstr, len));
+ }
+ recvnum_new[s] = recv_cnt;
+ }
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
std::vector<int> recvnum_new(nswap, 0); | |
#ifdef MPI_FOUND | |
if (lmp_list.world) { | |
MPI_Comm comm = *static_cast<MPI_Comm*>(lmp_list.world); | |
const int TAG_BASE = 0x7a31; | |
for (int s = 0; s < nswap; ++s) { | |
const int send_to = lmp_list.sendproc[s]; | |
const int recv_from = lmp_list.recvproc[s]; | |
int send_cnt = sendnum_new[s]; | |
int recv_cnt = 0; | |
MPI_Sendrecv(&send_cnt, 1, MPI_INT, send_to, TAG_BASE + s, | |
&recv_cnt, 1, MPI_INT, recv_from, TAG_BASE + s, | |
comm, MPI_STATUS_IGNORE); | |
recvnum_new[s] = recv_cnt; | |
} | |
} else | |
#endif | |
{ | |
for (int s = 0; s < nswap; ++s) recvnum_new[s] = sendnum_new[s]; | |
} | |
std::vector<int> recvnum_new(nswap, 0); | |
#ifdef MPI_FOUND | |
if (lmp_list.world) { | |
MPI_Comm comm = *static_cast<MPI_Comm*>(lmp_list.world); | |
const int TAG_BASE = 0x7a31; | |
for (int s = 0; s < nswap; ++s) { | |
const int send_to = lmp_list.sendproc[s]; | |
const int recv_from = lmp_list.recvproc[s]; | |
int send_cnt = sendnum_new[s]; | |
int recv_cnt = 0; | |
// MPI_PROC_NULL is allowed and acts as a no-op | |
int mpi_err = MPI_Sendrecv(&send_cnt, 1, MPI_INT, send_to, TAG_BASE + s, | |
&recv_cnt, 1, MPI_INT, recv_from, TAG_BASE + s, | |
comm, MPI_STATUS_IGNORE); | |
if (mpi_err != MPI_SUCCESS) { | |
char errstr[MPI_MAX_ERROR_STRING]; | |
int len = 0; | |
MPI_Error_string(mpi_err, errstr, &len); | |
throw deepmd::deepmd_exception( | |
std::string("MPI_Sendrecv(send_cnt/recv_cnt) failed: ") | |
+ std::string(errstr, len)); | |
} | |
recvnum_new[s] = recv_cnt; | |
} | |
} else | |
#endif | |
{ | |
for (int s = 0; s < nswap; ++s) recvnum_new[s] = sendnum_new[s]; | |
} |
🤖 Prompt for AI Agents
In source/api_cc/src/DeepPotPT.cc around lines 205-225, the MPI_Sendrecv calls
neither check return codes nor handle MPI_PROC_NULL peers; update the loop to
first check if send_to==MPI_PROC_NULL or recv_from==MPI_PROC_NULL and in that
case set recvnum_new[s]=0 and skip the MPI call, otherwise call MPI_Sendrecv and
capture its return value, compare against MPI_SUCCESS and on non-success log or
propagate an error and set recvnum_new[s]=0 (or handle per project error policy)
to avoid using uninitialized/incorrect counts; use MPI constants (MPI_PROC_NULL,
MPI_SUCCESS) and ensure the TAG_BASE logic is unchanged.
source/api_cc/src/DeepPotPT.cc
Outdated
torch::Tensor firstrecv_tensor = | ||
torch::from_blob(firstrecv_new.data(), {nswap}, int32_option).clone(); | ||
torch::Tensor recvnum_tensor = | ||
torch::from_blob(recvnum_new.data(), {nswap}, int32_option).clone(); | ||
torch::Tensor sendnum_tensor = | ||
torch::from_blob(sendnum_new.data(), {nswap}, int32_option).clone(); | ||
|
||
torch::Tensor sendlist_tensor = | ||
torch::from_blob(sendlist_new.data(), | ||
{ static_cast<long>(sendlist_new.size()) }, | ||
int32_option).clone(); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Use 64-bit shape type for Torch sizes and ensure consistency
Shapes in from_blob/view are int64_t; casting to long can truncate on LLP64 platforms (Windows).
Apply this diff:
- torch::Tensor firstrecv_tensor =
- torch::from_blob(firstrecv_new.data(), {nswap}, int32_option).clone();
- torch::Tensor recvnum_tensor =
- torch::from_blob(recvnum_new.data(), {nswap}, int32_option).clone();
- torch::Tensor sendnum_tensor =
- torch::from_blob(sendnum_new.data(), {nswap}, int32_option).clone();
+ torch::Tensor firstrecv_tensor =
+ torch::from_blob(firstrecv_new.data(), {static_cast<std::int64_t>(nswap)}, int32_option).clone();
+ torch::Tensor recvnum_tensor =
+ torch::from_blob(recvnum_new.data(), {static_cast<std::int64_t>(nswap)}, int32_option).clone();
+ torch::Tensor sendnum_tensor =
+ torch::from_blob(sendnum_new.data(), {static_cast<std::int64_t>(nswap)}, int32_option).clone();
@@
torch::Tensor sendlist_tensor =
torch::from_blob(sendlist_new.data(),
- { static_cast<long>(sendlist_new.size()) },
+ { static_cast<std::int64_t>(sendlist_new.size()) },
int32_option).clone();
Optional safety (outside this hunk): for consistency with the cloned new arrays, consider cloning sendproc/recvproc tensors as well, since they are built from external buffers:
torch::Tensor sendproc_tensor = torch::from_blob(lmp_list.sendproc, {static_cast<std::int64_t>(nswap)}, int32_option).clone();
torch::Tensor recvproc_tensor = torch::from_blob(lmp_list.recvproc, {static_cast<std::int64_t>(nswap)}, int32_option).clone();
🤖 Prompt for AI Agents
In source/api_cc/src/DeepPotPT.cc around lines 238 to 249, the torch::from_blob
shape arguments use casts to long which can truncate on LLP64 platforms; change
all shape casts to static_cast<std::int64_t> (e.g.,
static_cast<std::int64_t>(nswap) and
static_cast<std::int64_t>(sendlist_new.size())) when constructing tensors from
blobs, so sizes match Torch's int64_t expectation; optionally also construct and
clone sendproc/recvproc tensors from their buffers using the same std::int64_t
casts to ensure they are safe copies of external memory.
@@ -10,6 +10,13 @@ | |||
#include "device.h" | |||
#include "errors.h" | |||
|
|||
#ifdef USE_MPI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be given in the CMake file. See op
deepmd-kit/source/op/pt/CMakeLists.txt
Line 34 in ba8f52e
target_compile_definitions(deepmd_op_pt PRIVATE USE_MPI) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MPI also needs to be linked, if used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deepmd-kit/source/op/pt/CMakeLists.txt
Line 33 in ba8f52e
target_link_libraries(deepmd_op_pt PRIVATE MPI::MPI_CXX) |
source/api_cc/src/DeepPotPT.cc
Outdated
#ifdef USE_MPI | ||
#include <mpi.h> | ||
#ifdef OMPI_MPI_H | ||
#include <mpi-ext.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not needed.
for more information, see https://pre-commit.ci
Replaces usage of lmp_list send/recv arrays with new vectors that map indices using fwd_map and synchronize counts via MPI. Updates tensor construction to use these new vectors, improving correctness and flexibility in distributed communication.
Summary by CodeRabbit