Description
🚀 The feature, motivation and pitch
Context
Previously (late 2023) we kicked off our LLM workstream by enabling llama2 on ET. Both export and runner code live under examples/models/llama2
. As we support more and more models (llama3, phi-3, llava etc) we created the extension/llm
folder to avoid duplicate code for these models.
Later we started torchchat (June 2024) and due to its urgency we made the decision to duplicate code instead of reusing code from ET.
Now (Oct 2024) is a good time to consolidate the APIs we want to expose from ET, and let torchchat reuse them. This work is also beneficial for external users to use ET on LLMs.
Problems
Looking at LLM related code inside ET and in torchchat, I can see some problems:
Export flow:
- A lot of the features eligible for sharing are still in examples/models/llama/export_llama_lib.py .
- ET users are still writing their own export flow which may not be fully optimized.
- Missing features such as multiple entry points, for example if we want to export a .pte file with vision encoder as well as text decoder for multimodality.
- Option names are not descriptive and clear enough, lack of docstrings and help text.
Runner:
- Inside ET examples, multiple implementations of runner exist. This causes issues when we integrate these runners into demo apps, since we have to write JNI layers for each of the runners. E.g., Mediatek has its own runner
- Distribution channels need to be consolidated. We should prebuild runner code using iOS and Android toolchains and distribute the libraries/artifacts to the users. For special toolchain users we should provide good support to allow them to build their own runner from source.
- Other API changes, make the runner to resemble huggingface’s transformer API.
Alternatives
The alternative is to do nothing, which means extension/llm
will not be used by external users because it doesn't have good enough API and documentation.
Additional context
No response
RFC (Optional)
What should we offer and what should the APIs look like? Breaking down into several categories:
-
Model definition
- Our llm library should not hold any model definition given the intention is for users to use the export and runner on a generic transformer based llm. This implies the export utils and runner should work on most llms (e.g., from torchtune or huggingface). In the ET examples folder, we can showcase some of the models working with our llm library.
- However we will provide some example module definitions such as SDPA. This is because we use source transformation to replace these modules with custom ops that provide better performance on ET. These custom ops are then coupled tightly with the example modules, it makes sense to provide sample implementations. See more in the source transformation section.
- Proposal:
modules/
directory underextension/llm
for hosting these special modules. They will work with source transformations and custom ops.
- Proposal:
-
Export flow
LLMEdgeManager
as our main .pte export helper class. It takes atorch.nn.Module
along with other configs and provides APIs to quantize, lower to edge dialect, lower to different backends and eventually lower to ExecuTorch.- Proposal: add a top level entry point (a function like
executorch.extension.llm.export()
to follow the PyTorch convention). The function will return aLLMEdgeManager
, and users will callquantize()
,source_transformation()
etc.
- Proposal: add a top level entry point (a function like
- Source transformations. These files are currently sitting under
examples/llama/source_transformation
but they can be applied to other models as well.- Proposal: move to
extension/llm/export/source_transformation
. Ideally source transformation should not target for customizedtorch.nn.Module
and only target for PyTorch standardtorch.nn.Module
.
- Proposal: move to
- Quantization: currently we have quantizers defined in
extension/llm/export/quantizer_lib.py
, we should keep them there. There’s quantization code in source transformation as well, we should figure out if we can migrate them to torchao’s quantize_() API. - Partitioner: currently we have partitioners defined in
extension/llm/export/partitioner_lib.py
, we should keep them there.
-
C++ Runner & Tokenizer
Sampler
class- Proposal: should take
temperature
as an argument tosample()
method instead of an argument to constructor.
- Proposal: should take
Tokenizer
base class, withBPETokenizer
andTiktoken
extending it.- Proposal: merge with torchchat’s tokenizer implementations. Absorb
SPTokenizer
from torchchat. This means torchchat will start to use the tokenizer from ET.
- Proposal: merge with torchchat’s tokenizer implementations. Absorb
Runner
and other components such asTextPrefiller
,ImagePrefiller
,TextDecoder
,TextTokenGenerator
- Proposal: put them into iOS SwiftPM for iOS developers to use.
- Proposal: migrate existing runners in ET examples to use
Runner
base class so the JNI layer is simple.