Unify Android JNI to single IRunner, wire prefill to runner by kirklandsign · Pull Request #17756 · pytorch/executorch

kirklandsign · 2026-02-27T07:43:39Z

Summary

Replace the dual-runner pattern (runner_ + multi_modal_runner_) with a single IRunner* that holds either TextLLMRunner or MultimodalRunner, leveraging MultimodalRunner's new IRunner inheritance from #17741.

Each prefill method (text, images, audio) now immediately calls IRunner::prefill(vector) instead of buffering inputs for later consumption by generate(). A needs_bos_ flag tracks whether the next prefill should apply BOS tokens — MultimodalRunner also guards this via pos_==0 internally, but TextLLMRunner trusts the caller.

generate(), stop(), load(), and reset() no longer branch on model_type_category_; all dispatch through the unified runner_.

Rename all JNI native methods from append* to prefill* to match the existing Java public API naming.

Test plan

CI

Replace the dual-runner pattern (runner_ + multi_modal_runner_) with a single IRunner* that holds either TextLLMRunner or MultimodalRunner, leveraging MultimodalRunner's new IRunner inheritance from #17741. Each prefill method (text, images, audio) now immediately calls IRunner::prefill(vector<MultimodalInput>) instead of buffering inputs for later consumption by generate(). A needs_bos_ flag tracks whether the next prefill should apply BOS tokens — MultimodalRunner also guards this via pos_==0 internally, but TextLLMRunner trusts the caller. generate(), stop(), load(), and reset() no longer branch on model_type_category_; all dispatch through the unified runner_. Rename all JNI native methods from append* to prefill* to match the existing Java public API naming.

pytorch-bot · 2026-02-27T07:43:43Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17756

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (5 Unrelated Failures)

As of commit 0a3a5e4 with merge base 67bc28b ():

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

pull / test-models-linux (add_mul, xnnpack-quantization-delegation, linux.2xlarge) / linux-job (gh) (detected as infra flaky with no log or failing log classifier)
pull / test-models-linux-basic (mv3, portable, cmake, linux.2xlarge, executorch-ubuntu-22.04-clang12) / linux-job (gh) (detected as infra flaky with no log or failing log classifier)
Test CUDA Builds / export-model-cuda-artifact (nvidia, parakeet-tdt, quantized-int4-tile-packed) / linux-job (gh) (detected as infra flaky with no log or failing log classifier)

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

Test CUDA Windows Export and E2E / test-model-cuda-windows-e2e (mistralai, Voxtral-Mini-3B-2507, non-quantized) / windows-job (gh) (trunk failure)
Process completed with exit code 1.
Test CUDA Windows Export and E2E / test-model-cuda-windows-e2e (mistralai, Voxtral-Mini-3B-2507, quantized-int4-weight-only) / windows-job (gh) (trunk failure)
Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-02-27T07:44:25Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Remove "multimodal Module" wording since prefill methods now work through the unified IRunner for both text-only and multimodal models. Simplify return value docs.

Return Error::InvalidState when runner_ is null instead of silently returning 0 (success). Use needs_bos_ to gate num_bos in GenerationConfig so that a prior prefill() call prevents generate() from adding BOS a second time.

Copilot

Pull request overview

This pull request unifies the Android JNI layer to use a single IRunner* interface for both text-only and multimodal models, eliminating the dual-runner pattern. It changes the prefill workflow from buffering inputs to immediately populating the KV cache, and renames all JNI methods from append* to prefill* to match the public Java API.

Changes:

Replaced dual-runner pattern (runner_ + multi_modal_runner_) with unified IRunner* runner_ leveraging MultimodalRunner's new IRunner inheritance from PR #17741
Changed prefill methods to immediately call IRunner::prefill() instead of buffering inputs in prefill_inputs_ vector for later consumption
Added needs_bos_ flag to track whether the next prefill should apply BOS tokens, set to true on construction and after reset_context(), false after prefill operations

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
extension/android/jni/jni_layer_llama.cpp	Unified runner implementation, removed `multi_modal_runner_` and `prefill_inputs_` members, refactored all methods to use single `runner_` interface, renamed native methods from `append` to `prefill`
extension/android/executorch_android/src/main/java/org/pytorch/executorch/extension/llm/LlmModule.java	Updated native method declarations from `append` to `prefill` and revised documentation to reflect immediate KV cache population behavior

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-27T07:53:25Z

extension/android/jni/jni_layer_llama.cpp

+    if (!runner_) {
+      return 0;
    }
+    needs_bos_ = false;


Setting needs_bos_ to false before calling runner_->generate() is problematic because if generate() fails (e.g., due to encoding errors or other issues), needs_bos_ will remain false even though no actual generation occurred. This corrupts the state and means subsequent prefill calls won't apply BOS tokens even though they should. The needs_bos_ flag should only be set to false after verifying that generate() succeeded. However, since generate() currently doesn't check the return value (see separate comment), this should be fixed together with proper error handling.

extension/android/jni/jni_layer_llama.cpp

Copilot · 2026-02-27T07:53:25Z

extension/android/jni/jni_layer_llama.cpp

+    runner_->generate(
+        prompt->toStdString(),
+        config,
+        token_callback,
+        [callback](const llm::Stats& result) { callback->onStats(result); });


The generate method does not check the return value of runner_->generate(), which can return Error::InvalidArgument or other errors. If generate() fails, the error is silently ignored and 0 (success) is returned to the Java layer. This can lead to incorrect behavior and makes debugging difficult. The method should check the return value and propagate errors appropriately, similar to how the prefill_* methods handle errors.

Suggested change

runner_->generate(

prompt->toStdString(),

config,

token_callback,

[callback](const llm::Stats& result) { callback->onStats(result); });

auto result = runner_->generate(

prompt->toStdString(),

config,

token_callback,

[callback](const llm::Stats& result) { callback->onStats(result); });

if (!result.ok()) {

return static_cast<jint>(result.error());

}

Use num_bos_ (constructor) as the single BOS source in generate(), matching what prefill methods already use. This prevents inconsistency when the per-call num_bos parameter differs from the constructor value. Add block comment above prefill methods documenting the eager execution model, BOS tracking via needs_bos_, and the expected prefill+generate interaction (including echo behavior).

The compiler needs the full TextLLMRunner definition (not just the forward declaration from llm_runner_helper.h) to verify the inheritance from IRunner and allow unique_ptr<TextLLMRunner> to convert to unique_ptr<IRunner>.

runner_->generate() return value was silently ignored, always returning 0 (success). Now propagates the error code back to Java, consistent with how the prefill methods handle errors.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 10 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-27T20:14:46Z

extension/android/jni/jni_layer_llama.cpp

+    int32_t bos = needs_bos_ ? num_bos_ : 0;
+    needs_bos_ = false;
+    auto result = runner_->prefill(inputs, bos, /*num_eos=*/0);
+    if (!result.ok()) {
+      return static_cast<jint>(result.error());
+    }


needs_bos_ is cleared before checking whether runner_->prefill(...) succeeded. If prefill fails, subsequent prefill/generate calls won’t add BOS even though no successful prefill occurred. Only set needs_bos_ = false after result.ok() (or restore it on error).

Copilot · 2026-02-27T20:14:47Z

extension/android/jni/jni_layer_llama.cpp

+    int32_t bos = needs_bos_ ? num_bos_ : 0;
+    needs_bos_ = false;
+    auto result = runner_->prefill(inputs, bos, /*num_eos=*/0);
+    if (!result.ok()) {
+      return static_cast<jint>(result.error());
    }


needs_bos_ is cleared before checking whether runner_->prefill(...) succeeded. If prefill fails, subsequent prefill/generate calls won’t add BOS even though no successful prefill occurred. Only set needs_bos_ = false after result.ok() (or restore it on error).

Copilot · 2026-02-27T20:14:47Z

extension/android/jni/jni_layer_llama.cpp

+    int32_t bos = needs_bos_ ? num_bos_ : 0;
+    needs_bos_ = false;
+    auto result = runner_->prefill(inputs, bos, /*num_eos=*/0);
+    if (!result.ok()) {
+      return static_cast<jint>(result.error());
    }


needs_bos_ is cleared before checking whether runner_->prefill(...) succeeded. If prefill fails, subsequent prefill/generate calls won’t add BOS even though no successful prefill occurred. Only set needs_bos_ = false after result.ok() (or restore it on error).

Copilot · 2026-02-27T20:14:47Z

extension/android/jni/jni_layer_llama.cpp

+    int32_t bos = needs_bos_ ? num_bos_ : 0;
+    needs_bos_ = false;
+    auto result = runner_->prefill(inputs, bos, /*num_eos=*/0);
+    if (!result.ok()) {
+      return static_cast<jint>(result.error());
    }


needs_bos_ is cleared before checking whether runner_->prefill(...) succeeded. If prefill fails, subsequent prefill/generate calls won’t add BOS even though no successful prefill occurred. Only set needs_bos_ = false after result.ok() (or restore it on error).

Copilot · 2026-02-27T20:14:47Z

extension/android/jni/jni_layer_llama.cpp

+        .echo = static_cast<bool>(echo),
+        .seq_len = seq_len,
+        .temperature = effective_temperature,
+        .num_bos = needs_bos_ ? num_bos_ : 0,


generate() ignores the num_bos argument and instead uses the constructor default num_bos_ when needs_bos_ is true. This changes the behavior of LlmGenerationConfig.getNumBos() / the JNI signature and makes per-call BOS control ineffective. Use the num_bos parameter for this call (or remove it from the JNI/Java API if it’s intentionally fixed at init).

Suggested change

.num_bos = needs_bos_ ? num_bos_ : 0,

.num_bos = needs_bos_ ? num_bos : 0,

Copilot · 2026-02-27T20:14:47Z

extension/android/jni/jni_layer_llama.cpp

      ss << "Invalid model type category: " << model_type_category_
         << ". Valid values are: " << MODEL_TYPE_CATEGORY_LLM << " or "
         << MODEL_TYPE_CATEGORY_MULTIMODAL;


When runner_ is null, load() always throws an exception saying the model type category is invalid. But create_text_llm_runner(...) / create_multimodal_runner(...) can also return nullptr on initialization failures (e.g., tokenizer not loaded, metadata read failure), which would make this message misleading. Consider distinguishing “unsupported model_type_category_” from “failed to create runner” and reporting the latter accurately.

Suggested change

ss << "Invalid model type category: " << model_type_category_

<< ". Valid values are: " << MODEL_TYPE_CATEGORY_LLM << " or "

<< MODEL_TYPE_CATEGORY_MULTIMODAL;

if (model_type_category_ != MODEL_TYPE_CATEGORY_LLM &&

model_type_category_ != MODEL_TYPE_CATEGORY_MULTIMODAL) {

ss << "Invalid model type category: " << model_type_category_

<< ". Valid values are: " << MODEL_TYPE_CATEGORY_LLM << " or "

<< MODEL_TYPE_CATEGORY_MULTIMODAL;

} else {

ss << "Failed to create runner for model type category: "

<< model_type_category_;

}

Copilot · 2026-02-27T20:14:48Z

...android/executorch_android/src/main/java/org/pytorch/executorch/extension/llm/LlmModule.java

  /**
-   * Prefill a multimodal Module with the given images input.
+   * Prefill the KV cache with the given image input.
   *
   * @param image Input image as a byte array
   * @param width Input image width
   * @param height Input image height
   * @param channels Input image number of channels
-   * @return 0, as the updated starting position in KV cache of the input in the LLM is no longer
-   *     exposed to user.
+   * @return 0 on success


Javadoc says the image parameter is a “byte array”, but the API type is int[]. Update the parameter description to match the actual type (and expected value range) so callers don’t pass the wrong format.

Copilot · 2026-02-27T20:14:48Z

extension/android/jni/jni_layer_llama.cpp

+    int32_t bos = needs_bos_ ? num_bos_ : 0;
+    needs_bos_ = false;
+    auto result = runner_->prefill(inputs, bos, /*num_eos=*/0);
+    if (!result.ok()) {
+      return static_cast<jint>(result.error());
    }


needs_bos_ is cleared before checking whether runner_->prefill(...) succeeded. If prefill fails, subsequent prefill/generate calls won’t add BOS even though no successful prefill occurred. Only set needs_bos_ = false after result.ok() (or restore it on error).

Copilot · 2026-02-27T20:14:48Z

extension/android/jni/jni_layer_llama.cpp

+    needs_bos_ = false;
+    auto result = runner_->prefill(inputs, bos, /*num_eos=*/0);
+    if (!result.ok()) {
+      return static_cast<jint>(result.error());
    }


needs_bos_ is cleared before checking whether runner_->prefill(...) succeeded. If prefill fails, subsequent prefill/generate calls won’t add BOS even though no successful prefill occurred. Only set needs_bos_ = false after result.ok() (or restore it on error).

Suggested change

needs_bos_ = false;

auto result = runner_->prefill(inputs, bos, /*num_eos=*/0);

if (!result.ok()) {

return static_cast<jint>(result.error());

}

auto result = runner_->prefill(inputs, bos, /*num_eos=*/0);

if (!result.ok()) {

return static_cast<jint>(result.error());

}

needs_bos_ = false;

Copilot · 2026-02-27T20:14:48Z

extension/android/jni/jni_layer_llama.cpp

+    needs_bos_ = false;
+    auto err = runner_->generate(
+        prompt->toStdString(),
+        config,
+        token_callback,
+        [callback](const llm::Stats& result) { callback->onStats(result); });
+    return static_cast<jint>(err);


needs_bos_ is set to false before verifying that runner_->generate(...) succeeded, and the return value from generate() is ignored. If generation fails, the JNI method still returns success (0) and future calls won’t prepend BOS. Capture and return/throw on the runtime::Error from runner_->generate(...), and only clear needs_bos_ on success.

Suggested change

needs_bos_ = false;

auto err = runner_->generate(

prompt->toStdString(),

config,

token_callback,

[callback](const llm::Stats& result) { callback->onStats(result); });

return static_cast<jint>(err);

auto err = runner_->generate(

prompt->toStdString(),

config,

token_callback,

[callback](const llm::Stats& result) { callback->onStats(result); });

if (err != Error::Ok) {

return static_cast<jint>(err);

}

needs_bos_ = false;

return static_cast<jint>(Error::Ok);

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 27, 2026

Update prefill javadoc to reflect unified runner semantics

6af3dc2

Remove "multimodal Module" wording since prefill methods now work through the unified IRunner for both text-only and multimodal models. Simplify return value docs.

kirklandsign marked this pull request as ready for review February 27, 2026 07:46

Copilot AI review requested due to automatic review settings February 27, 2026 07:46

Copilot started reviewing on behalf of kirklandsign February 27, 2026 07:47 View session

kirklandsign added 3 commits February 26, 2026 23:48

Linter

af0403b

Linter

dfe6bd6

Fix generate() error handling and double-BOS in JNI layer

7eed30b

Return Error::InvalidState when runner_ is null instead of silently returning 0 (success). Use needs_bos_ to gate num_bos in GenerationConfig so that a prior prefill() call prevents generate() from adding BOS a second time.

Copilot AI reviewed Feb 27, 2026

View reviewed changes

kirklandsign added 3 commits February 26, 2026 23:54

Remove unnecessary comment block above prefill methods

f2f5444

Include text_llm_runner.h to fix unique_ptr conversion

d74c289

The compiler needs the full TextLLMRunner definition (not just the forward declaration from llm_runner_helper.h) to verify the inheritance from IRunner and allow unique_ptr<TextLLMRunner> to convert to unique_ptr<IRunner>.

Copilot AI review requested due to automatic review settings February 27, 2026 20:08

Copilot started reviewing on behalf of kirklandsign February 27, 2026 20:10 View session

Propagate generate() error code to Java layer

0a3a5e4

runner_->generate() return value was silently ignored, always returning 0 (success). Now propagates the error code back to Java, consistent with how the prefill methods handle errors.

Copilot AI reviewed Feb 27, 2026

View reviewed changes

	.num_bos = needs_bos_ ? num_bos_ : 0,
	.num_bos = needs_bos_ ? num_bos : 0,

Conversation

kirklandsign commented Feb 27, 2026

Summary

Test plan

Uh oh!

pytorch-bot bot commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17756

✅ You can merge normally! (5 Unrelated Failures)

Uh oh!

github-actions bot commented Feb 27, 2026

This PR needs a release notes: label

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pytorch-bot bot commented Feb 27, 2026 •

edited

Loading

This PR needs a `release notes:` label