RFC: extension/llm API design

### 🚀 The feature, motivation and pitch

## Context

Previously (late 2023) we kicked off our LLM workstream by enabling llama2 on ET. Both export and runner code live under `examples/models/llama2`. As we support more and more models (llama3, phi-3, llava etc) we created the `extension/llm` folder to avoid duplicate code for these models.

Later we started torchchat (June 2024) and due to its urgency we made the decision to duplicate code instead of reusing code from ET. 

Now (Oct 2024) is a good time to consolidate the APIs we want to expose from ET, and let torchchat reuse them. This work is also beneficial for external users to use ET on LLMs.


## Problems

Looking at LLM related code inside ET and in torchchat, I can see some problems:

Export flow:

- A lot of the features eligible for sharing are still in examples/models/llama/export_llama_lib.py .
- ET users are still writing their own export flow which may not be fully optimized.
- Missing features such as multiple entry points, for example if we want to export a .pte file with vision encoder as well as text decoder for multimodality.
- Option names are not descriptive and clear enough, lack of docstrings and help text.

Runner:
- Inside ET examples, multiple implementations of runner exist. This causes issues when we integrate these runners into demo apps, since we have to write JNI layers for each of the runners. E.g., Mediatek has its own [runner](https://github.com/pytorch/executorch/pull/6208/files#diff-54aeec113a52bdc1c95429030351644f7956fab97d49cd1e1c56e39021bb961d)
- Distribution channels need to be consolidated. We should prebuild runner code using iOS and Android toolchains and distribute the libraries/artifacts to the users. For special toolchain users we should provide good support to allow them to build their own runner from source.
- Other API changes, make the runner to resemble huggingface’s transformer API.

### Alternatives

The alternative is to do nothing, which means `extension/llm` will not be used by external users because it doesn't have good enough API and documentation.

### Additional context

_No response_

### RFC (Optional)

What should we offer and what should the APIs look like? Breaking down into several categories:
- Model definition
    - Our llm library should not hold any model definition given the intention is for users to use the export and runner on a generic transformer based llm. This implies the export utils and runner should work on most llms (e.g., from torchtune or huggingface). In the ET examples folder, we can showcase some of the models working with our llm library.
    - However we will provide some example module definitions such as SDPA. This is because we use source transformation to replace these modules with custom ops that provide better performance on ET. These custom ops are then coupled tightly with the example modules, it makes sense to provide sample implementations. See more in the source transformation section.
        - Proposal: `modules/` directory under `extension/llm` for hosting these special modules. They will work with source transformations and custom ops.

- Export flow
    - `LLMEdgeManager` as our main .pte export helper class. It takes a `torch.nn.Module` along with other configs and provides APIs to quantize, lower to edge dialect, lower to different backends and eventually lower to ExecuTorch.
        - Proposal: add a top level entry point (a function like `executorch.extension.llm.export()` to follow the PyTorch convention). The function will return a `LLMEdgeManager`, and users will call `quantize()`, `source_transformation()` etc.
    - Source transformations. These files are currently sitting under `examples/llama/source_transformation` but they can be applied to other models as well.
        - Proposal: move to `extension/llm/export/source_transformation`. Ideally source transformation should not target for customized `torch.nn.Module` and only target for PyTorch standard `torch.nn.Module`.
    - Quantization: currently we have quantizers defined in `extension/llm/export/quantizer_lib.py`, we should keep them there. There’s quantization code in source transformation as well, we should figure out if we can migrate them to torchao’s quantize_() API.
    - Partitioner:  currently we have partitioners defined in `extension/llm/export/partitioner_lib.py`, we should keep them there.

- C++ Runner & Tokenizer
    - `Sampler` class
        - Proposal: should take `temperature` as an argument to `sample()` method instead of an argument to constructor.
    - `Tokenizer` base class, with `BPETokenizer` and `Tiktoken` extending it.
        - Proposal: merge with torchchat’s tokenizer implementations. Absorb `SPTokenizer` from torchchat. This means torchchat will start to use the tokenizer from ET.
    - `Runner` and other components such as `TextPrefiller`, `ImagePrefiller`, `TextDecoder`, `TextTokenGenerator`
        - Proposal: put them into iOS SwiftPM for iOS developers to use.
        - Proposal: migrate existing runners in ET examples to use `Runner` base class so the JNI layer is simple.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RFC: extension/llm API design #6558

🚀 The feature, motivation and pitch

Context

Problems

Alternatives

Additional context

RFC (Optional)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RFC: extension/llm API design #6558

Description

🚀 The feature, motivation and pitch

Context

Problems

Alternatives

Additional context

RFC (Optional)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions