Skip to content

[RFC] Move Transducer (RNN-T/TDT) support to extension/asr/runner/Β #17686

@kirklandsign

Description

@kirklandsign

πŸš€ The feature, motivation and pitch

Motivation

extension/asr/runner/ currently provides AsrRunner, which only supports Seq2Seq (encoder-decoder) models like Whisper. The decode loop assumes a standard autoregressive pattern: encoder β†’ text_decoder(input_ids, encoder_output, cache_position) β†’ logits β†’ sample β†’ next_token.

Transducer-based ASR models (RNN-T, TDT, HAT) use a fundamentally different decode paradigm β€” frame-by-frame scanning with a joint network β€” and cannot reuse AsrRunner. As a result, the Parakeet TDT runner (examples/models/parakeet/main.cpp) implements the entire decode algorithm inline (~200 lines of greedy decode + LSTM state management), making it hard to reuse for other transducer models.

Proposal

Restructure extension/asr/runner/ to support both architectures:

  1. Rename AsrRunner β†’ Seq2SeqRunner to clarify that it's Seq2Seq-specific
  2. Add TransducerRunner for RNN-T/TDT models, extracting the core decode logic from Parakeet's main.cpp
  3. Keep both in the same flat directory (no subdirectories)

Proposed file layout

extension/asr/runner/
β”œβ”€β”€ CMakeLists.txt
β”œβ”€β”€ seq2seq_runner.h         # renamed from runner.h
β”œβ”€β”€ seq2seq_runner.cpp       # renamed from runner.cpp
β”œβ”€β”€ transducer_runner.h      # new
└── transducer_runner.cpp    # new

TransducerRunner sketch

namespace executorch::extension::asr {

struct TransducerConfig {
  int64_t blank_id = 0;
  int64_t num_rnn_layers = 2;
  int64_t pred_hidden = 640;
  int64_t max_symbols_per_step = 10;
  // TDT duration values; empty = standard RNN-T (duration always 1)
  std::vector<int> durations = {};
};

class TransducerRunner {
 public:
  TransducerRunner(
      const std::string& module_path,
      const std::string& tokenizer_path,
      TransducerConfig config);

  Error load();

  // Returns decoded token IDs with frame offsets
  Result<std::vector<Token>> transcribe(
      TensorPtr preprocessed_features,
      std::function<void(const std::string&)> token_callback = {});
};

}  // namespace executorch::extension::asr

Expected module methods: encoder, decoder_step, joint (+ optional preprocessor).

What stays in examples/models/parakeet/

Model-specific post-processing (timestamp computation at token/word/segment level) remains in the example β€” it's not general enough for a shared runner.

Migration

  • Whisper main.cpp: AsrRunner β†’ Seq2SeqRunner (one-line rename)
  • Parakeet main.cpp: replace inline decode with TransducerRunner::transcribe()
  • Downstream consumers of AsrRunner: update include path and class name

Alternatives

No response

Additional context

No response

RFC (Optional)

No response

cc @larryliu0820 @mergennachin @cccclai @helunwencser @jackzhxng

Metadata

Metadata

Assignees

Labels

module: llmIssues related to LLM examples and apps, and to the extensions/llm/ code

Projects

Status

To triage

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions