Skip to content

Fix cuda overflow#19487

Open
lucylq wants to merge 1 commit into
mainfrom
security20-cuda
Open

Fix cuda overflow#19487
lucylq wants to merge 1 commit into
mainfrom
security20-cuda

Conversation

@lucylq
Copy link
Copy Markdown
Contributor

@lucylq lucylq commented May 11, 2026

Update copy_ method to check

  • this->sizes() == other.sizes()
  • this->dim() == other.dim()
    Use overflow-safe arithmetic.

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented May 11, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19487

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

❌ 1 Pending, 1 Unclassified Failure

As of commit ce6fc1b with merge base 23a91d5 (image):

UNCLASSIFIED FAILURE - DrCI could not classify the following job because the workflow did not run on the merge base. The failure may be pre-existing on trunk or introduced by this PR:

  • Check Labels / Check labels (gh) (this job did not run on the merge base, so DrCI cannot tell whether the failure is pre-existing)
    RuntimeError: GraphQL query

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@lucylq lucylq changed the title overflow Fix cuda overflow May 11, 2026
@github-actions
Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 11, 2026
@lucylq lucylq force-pushed the security20-cuda branch 2 times, most recently from e5dfd0f to df170ce Compare May 11, 2026 23:20
@lucylq lucylq requested a review from Gasoonjia May 11, 2026 23:20
@lucylq
Copy link
Copy Markdown
Contributor Author

lucylq commented May 11, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented May 11, 2026

Claude finished @lucylq's task in 1m 6s —— View job


Code Review: Fix cuda overflow

  • Gather context and read the changed file
  • Analyze the diff for correctness, security, and style
  • Post review feedback

Overall: This is a solid security hardening PR. The changes add proper dimension/size validation and replace unchecked arithmetic with overflow-safe operations in the non-contiguous copy_ path. The approach is correct and well-structured.

Strengths

  1. Dimension and size checks (lines 443-449): Good defense-in-depth. Checking dim() and sizes() equality before the existing numel() check catches shape mismatches that happen to have the same total element count (e.g., [2,3] vs [3,2]).

  2. Overflow-safe offset computation (lines 474-484): Using c10::mul_overflows and c10::add_overflows with early-exit on overflow is the right pattern. The accumulation loop correctly builds both src and dst offsets simultaneously.

  3. Byte offset overflow check (lines 488-498): Good catch — the element-to-byte conversion (offset * elem_size) is another multiplication that can overflow, and the non-negativity assertion on src_offset/dst_offset prevents signed-to-unsigned conversion issues.

Minor observations

  1. Redundant numel() check: With dim() and sizes() equality now enforced, the numel() check on line 451 is logically redundant (equal sizes implies equal numel). It's fine to keep as a belt-and-suspenders assertion, but worth noting it can never fire if the prior checks pass.

  2. compute_numel in size_util.h is still unchecked (slim/util/size_util.h:23-29): The refresh_numel() path calls compute_numel which does n *= s without overflow protection. If an attacker can control sizes passed to set_sizes_and_strides, the numel could overflow before copy_ is ever called. This is out of scope for this PR but worth a follow-up.

  3. nbytes() on line 101: numel() * itemsize() is also a potential overflow site for the contiguous fast path (storage_->copy_(this->data_ptr(), other.data_ptr(), other.nbytes(), ...)). Again out of scope, but relevant to the broader hardening effort.

Verdict

LGTM — the changes are correct, focused, and improve the security posture of the non-contiguous copy path. The new checks are properly ordered (cheapest first) and the overflow detection uses battle-tested compiler builtins via c10::mul_overflows/c10::add_overflows.


@lucylq lucylq marked this pull request as ready for review May 11, 2026 23:22
Copilot AI review requested due to automatic review settings May 11, 2026 23:22
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens SlimTensor::copy_ in the AOTI slim runtime by enforcing stricter shape compatibility and making strided offset/byte computations overflow-safe (particularly relevant for CUDA pointer arithmetic).

Changes:

  • Added precondition checks that dim() and sizes() match before copying.
  • Reworked strided element-wise copy offset computation to use c10::{mul,add}_overflows and to compute byte offsets explicitly.
  • Updated CPU/CUDA memcpy call sites to use the computed byte offsets.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +490 to +498
ET_CHECK_MSG(
src_offset >= 0 && dst_offset >= 0 &&
!c10::mul_overflows(
static_cast<size_t>(src_offset),
elem_size,
&src_byte_offset) &&
!c10::mul_overflows(
static_cast<size_t>(dst_offset), elem_size, &dst_byte_offset),
"copy_: byte offset overflow");
Comment on lines +443 to +449
ET_CHECK_MSG(
this->dim() == other.dim(),
"copy_: dim of tensors must match (%zu vs %zu)",
this->dim(),
other.dim());
ET_CHECK_MSG(
this->sizes() == other.sizes(), "copy_: sizes of tensors must match");
Comment thread backends/aoti/slim/core/slim_tensor.h Outdated
int64_t src_term = 0;
int64_t dst_term = 0;
ET_CHECK_MSG(
!c10::mul_overflows(counter[d], other.stride(d), &src_term) &&
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i feel like we only need two overflow calculation, one for src = counter[d] * this->stride(d), the other for dst_offset = dst_offset + src, thoughts?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah - thank you!

*/
SlimTensor& copy_(const SlimTensor& other) {
ET_CHECK_MSG(
this->dim() == other.dim(),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im not a huge fan of adding this new limitation. Two tensor should be copyable as long as their numel are the same. Maybe we can remove this assertion?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'd also need to update the loop below if we want tensors to be copyable with different dim/sizes..

@Gasoonjia
Copy link
Copy Markdown
Contributor

thx for adding the overflow check! plz solve the broken cuda cis as well.

@lucylq lucylq force-pushed the security20-cuda branch from df170ce to 9a1980d Compare May 12, 2026 16:13
* @return Reference to this tensor
*/
SlimTensor& copy_(const SlimTensor& other) {
ET_CHECK_MSG(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check also enforces two tensors having same shape -- can we remove it?

Copilot AI review requested due to automatic review settings May 12, 2026 20:57
@lucylq lucylq force-pushed the security20-cuda branch from 9a1980d to 02e0ad0 Compare May 12, 2026 20:57
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.

Comment on lines +485 to +490
!::c10::mul_overflows(
static_cast<size_t>(src_offset),
elem_size,
&src_byte_offset) &&
!::c10::mul_overflows(
static_cast<size_t>(dst_offset), elem_size, &dst_byte_offset),
Comment on lines +483 to +485
ET_CHECK_MSG(
src_offset >= 0 && dst_offset >= 0 &&
!::c10::mul_overflows(
Comment on lines +443 to 447
ET_CHECK_MSG(
this->sizes() == other.sizes(), "copy_: sizes of tensors must match");
ET_CHECK_MSG(
this->numel() == other.numel(), "copy_: numel of tensors must match");
ET_CHECK_MSG(this->dtype() == other.dtype(), "copy_: dtype must match");
@lucylq lucylq force-pushed the security20-cuda branch from 02e0ad0 to b78a69d Compare May 12, 2026 22:10
Copilot AI review requested due to automatic review settings May 12, 2026 22:13
@lucylq lucylq force-pushed the security20-cuda branch from b78a69d to bddf2a8 Compare May 12, 2026 22:13
@lucylq lucylq force-pushed the security20-cuda branch from bddf2a8 to ebc9978 Compare May 12, 2026 22:16
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.

Comment on lines 501 to 505
// Copy elem_size bytes from src to dst
if (this->device().is_cpu() && other.device().is_cpu()) {
std::memcpy(
dst_data + dst_offset * elem_size,
src_data + src_offset * elem_size,
elem_size);
dst_data + dst_byte_offset, src_data + src_byte_offset, elem_size);
} else if (this->device().is_cuda() || other.device().is_cuda()) {
Comment on lines 442 to +446
SlimTensor& copy_(const SlimTensor& other) {
ET_CHECK_MSG(
this->dim() == other.dim(),
"copy_: dim of tensors must match (%zu vs %zu)",
this->dim(),
@Gasoonjia
Copy link
Copy Markdown
Contributor

@claude review the PR

@claude
Copy link
Copy Markdown

claude Bot commented May 12, 2026

Claude finished @Gasoonjia's task in 1m 49s —— View job


Code Review: Fix cuda overflow

  • Gather context and read the changed file
  • Analyze the diff for correctness and security
  • Review feedback from other reviewers
  • Post review feedback

Overall: Good security hardening. The overflow-safe arithmetic is correct and uses the right primitives. There are a few open design questions and minor issues worth addressing.

Correctness & Security

  1. Overflow checks are soundc10::mul_overflows/c10::add_overflows wrap compiler builtins (__builtin_mul_overflow etc.) and correctly detect overflow before the result is used. The pattern of computing into temporary terms then accumulating is correct.

  2. static_cast<size_t>(src_offset) after >= 0 check (slim_tensor.h:492-498) — The non-negativity check ensures the cast doesn't reinterpret a negative value as a large unsigned number. On 64-bit targets (which all CUDA platforms are), size_t == uint64_t, so no truncation occurs. On a hypothetical 32-bit build, values > UINT32_MAX would truncate before the overflow check. This is extremely unlikely in practice for SlimTensor's use cases, but if you want to be rigorous, add a static_assert(sizeof(size_t) >= sizeof(int64_t)) or a range check.

  3. Negative strides (slim_tensor.h:492) — The src_offset >= 0 && dst_offset >= 0 assertion rejects tensors whose strided traversal produces negative offsets (e.g., torch.flip). This is fine for SlimTensor (which likely never produces negative strides), but should be called out explicitly since it's a behavioral constraint that isn't obvious from the function signature.

Design Feedback

  1. dim()/sizes() equality restriction (slim_tensor.h:443-449) — Agreeing with @Gasoonjia's feedback: standard PyTorch copy_ only requires equal numel(), allowing e.g. [6].copy_([2,3]). The current strided loop assumes equal dims (uses this->dim() to iterate both src and dst), so enforcing this is correct for the current implementation. If relaxing this is desired later, the loop would need to iterate dimensions independently per tensor. For now, I'd suggest keeping the check but updating the doc comment to document this restriction explicitly.

  2. Redundant numel() check (slim_tensor.h:450-451) — With equal sizes() enforced, numel() equality is mathematically guaranteed. Fine to keep as defense-in-depth, but if you'd like to slim down, it can be removed.

  3. Stale doc comment (slim_tensor.h:437) — Says "Currently only supports CPU-to-CPU copy (contiguous tensors only)" but the implementation handles CUDA and non-contiguous. Should be updated to reflect reality.

Follow-up Items (out of scope but worth tracking)

  1. compute_numel in size_util.h:23-28 uses unchecked n *= s. If sizes are attacker-controlled, numel can overflow before reaching copy_. This is a broader issue than this PR but worth a follow-up.

  2. nbytes() at slim_tensor.h:101 (numel() * itemsize()) is also unchecked and used in the contiguous fast path at line 461. Same follow-up category.

  3. CI failures@Gasoonjia mentioned broken CUDA CIs. These should be investigated separately — the overflow logic itself looks correct for both CPU and CUDA paths.

Verdict

The core security fix (overflow-safe offset arithmetic) is correct and well-implemented. The main discussion point is the dim()/sizes() restriction (items 4, addressed in review thread). Updating the stale doc comment and considering the 32-bit cast edge case would be nice cleanups.


Comment thread backends/aoti/slim/core/slim_tensor.h Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update the doc as well to show that two tensors should have same sizes
And can you do me a favor to remove Currently only supports CPU-to-CPU copy (contiguous tensors only). line? it has out of date loll😛

@lucylq lucylq force-pushed the security20-cuda branch from ebc9978 to ce6fc1b Compare May 12, 2026 22:25
@lucylq
Copy link
Copy Markdown
Contributor Author

lucylq commented May 13, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented May 13, 2026

Claude finished @lucylq's task in 1m 47s —— View job


Code Review: Fix cuda overflow

  • Gather context and read the current state of the changed file
  • Analyze the diff against merge base
  • Review against prior feedback (Gasoonjia's comments, copilot suggestions)
  • Post review feedback

Overall: The security hardening looks correct and well-structured. The overflow-safe arithmetic uses the right primitives, and the doc comment has been properly updated. The PR addresses the prior review feedback well (stale comment removed, sizes restriction documented). A few remaining items:

Looks Good

  1. Overflow detection pattern (lines 479-484): Using c10::mul_overflows/c10::add_overflows with short-circuit evaluation is correct — if any operation overflows, the check fires before the corrupted value propagates.

  2. Byte offset overflow check (lines 490-498): The non-negativity guard before static_cast<size_t> prevents signed-to-unsigned reinterpretation. On all CUDA platforms (64-bit), sizeof(size_t) == sizeof(int64_t), so no truncation.

  3. Doc comment (lines 433-439): Updated to match the new preconditions. Stale "CPU-to-CPU" reference removed per @Gasoonjia's request.

Remaining Items

  1. data_ptr() has unchecked arithmetic (slim_tensor.h:212):

    return static_cast<char*>(storage_->data()) + storage_offset_ * itemsize();

    This multiplication can overflow for the same reason the strided offsets could. It's called in the contiguous fast path (line 459-460) — a malicious storage_offset_ could produce an OOB pointer before the new overflow checks even run. Worth a follow-up PR.

  2. nbytes() unchecked (slim_tensor.h:100-101): numel() * itemsize() feeds directly into the contiguous storage_->copy_() call at line 460. Same category as above.

  3. compute_numel unchecked (size_util.h:23-28): Uses n *= s without overflow detection. If sizes are attacker-controlled via set_sizes_and_strides, numel can silently wrap before copy_ is called, potentially causing the numel() check (line 450) to pass incorrectly.

  4. CI failures: @Gasoonjia mentioned broken CUDA CIs — these should be investigated separately as the logic here looks correct for both CPU and CUDA paths.

Items 1-3 are pre-existing issues and out of scope for this PR, but worth tracking as follow-ups to complete the hardening effort.

Verdict

LGTM — the core fix is correct and focused. The new checks are ordered cheaply (dim/sizes comparison before the per-element loop) and the overflow detection uses well-tested compiler builtins. The remaining unchecked paths (data_ptr, nbytes, compute_numel) are pre-existing and appropriate for follow-up work.


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/cuda CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants