Add CudaSampler class for GPU-based token sampling #16387

larryliu0820 · 2025-12-24T01:46:57Z

Add CudaSampler class that provides a high-level interface for GPU sampling:

cuda_sampler.h: Class declaration with sample_argmax() method.
Pre-allocates GPU memory to avoid allocation in hot path.
cuda_sampler.cu: Implementation using the default CUDA stream (nullptr)
for implicit synchronization with the CUDA backend's stream.

The default stream approach ensures proper ordering between decoder
output and argmax without requiring explicit cross-stream synchronization
or access to the backend's internal stream.

[ghstack-poisoned]

larryliu0820 · 2025-12-24T01:46:58Z

Stack from ghstack (oldest at bottom):

pytorch-bot · 2025-12-24T01:47:01Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/16387

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures, 3 Unrelated Failures

As of commit 8573c9a with merge base c5d66a5 ():

NEW FAILURES - The following jobs have failed:

Lint / lintrunner / linux-job (gh)
>>> Lint for extension/llm/sampler/cuda_sampler.h:
pull / test-multimodal-linux (gemma3-4b) / linux-job (gh)
RuntimeError: Command docker exec -t d2dcb92572f560fa9697db3ec65a5397b8a1c4c6c3f2a132774c6ccdce50dc96 /exec failed with exit code 139
Test Metal Backend / test-executorch-metal-build / macos-job (gh)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 1

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / test-samsung-models-linux / linux-job (gh) (trunk failure)
##[error]The operation was canceled.
pull / unittest / macos / macos-job (gh) (trunk failure)
export/tests/test_target_recipes.py::TestTargetRecipes::test_mv3_model

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

pull / android / run-emulator (gh) (#16137)
Timeout waiting for emulator to boot.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Gasoonjia · 2025-12-25T06:46:17Z

extension/llm/sampler/cuda_sampler.cu

+//    - The argmax kernel will wait for the decoder to finish writing logits
+//    - No explicit cudaDeviceSynchronize() or cross-stream synchronization needed
+//
+// 4. Trade-off: Using the default stream prevents concurrent execution between


Im wondering if we really need to make sampler and cuda backend using same cuda stream, since the sampling and decoding should be able to work in parallel: the argmax process of logits_{i} should be able to work with the decoder generating logits_{i+1} since they do not have any dependency, and such parallelism may not happen if argmax and decoder share the same cudastream.

Update

8573c9a

[ghstack-poisoned]

larryliu0820 requested a review from mergennachin as a code owner December 24, 2025 01:46

This was referenced Dec 24, 2025

Add CUDA argmax kernel for LLM sampler #16386

Open

Integrate CUDA sampler into ASR runner and enable skip_copy for decoder #16388

Open

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 24, 2025

Gasoonjia approved these changes Dec 25, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add CudaSampler class for GPU-based token sampling #16387

Add CudaSampler class for GPU-based token sampling #16387

larryliu0820 commented Dec 24, 2025

Uh oh!

larryliu0820 commented Dec 24, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Dec 24, 2025 •

edited

Loading

Uh oh!

Gasoonjia Dec 25, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add CudaSampler class for GPU-based token sampling #16387

Are you sure you want to change the base?

Add CudaSampler class for GPU-based token sampling #16387

Conversation

larryliu0820 commented Dec 24, 2025

Uh oh!

larryliu0820 commented Dec 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Dec 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/16387

❌ 3 New Failures, 3 Unrelated Failures

Uh oh!

Gasoonjia Dec 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

larryliu0820 commented Dec 24, 2025 •

edited

Loading

pytorch-bot bot commented Dec 24, 2025 •

edited

Loading

Gasoonjia Dec 25, 2025 •

edited

Loading