|
| 1 | +# Arm Ethos-U NPU Backend Tutorial |
| 2 | + |
| 3 | +<!----This will show a grid card on the page-----> |
| 4 | +::::{grid} 2 |
| 5 | + |
| 6 | +:::{grid-item-card} Tutorials we recommend you complete before this: |
| 7 | +:class-card: card-prerequisites |
| 8 | +* [Introduction to ExecuTorch](intro-how-it-works.md) |
| 9 | +* [Getting Started](getting-started.md) |
| 10 | +* [Building ExecuTorch with CMake](using-executorch-building-from-source.md) |
| 11 | +::: |
| 12 | + |
| 13 | +:::{grid-item-card} What you will learn in this tutorial: |
| 14 | +:class-card: card-prerequisites |
| 15 | +In this tutorial you will learn how to export a simple PyTorch model for the ExecuTorch Ethos-U backend. |
| 16 | +::: |
| 17 | + |
| 18 | +:::: |
| 19 | + |
| 20 | +```{warning} |
| 21 | +This delegate is under active development, to get best results please use a recent version. |
| 22 | +The TOSA and Ethos-U backend support is reasonably mature and used in production by some users. |
| 23 | +You may encounter some rough edges and features which may be documented or planned but not implemented, please refer to the in-tree documentation for the latest status of features. |
| 24 | +``` |
| 25 | + |
| 26 | +```{tip} |
| 27 | +If you are already familiar with this delegate, you may want to jump directly to the examples: |
| 28 | +* [Examples in the ExecuTorch repository](https://github.com/pytorch/executorch/tree/main/examples/arm) |
| 29 | +* [A commandline compiler for example models](https://github.com/pytorch/executorch/blob/main/examples/arm/aot_arm_compiler.py) |
| 30 | +``` |
| 31 | + |
| 32 | +This tutorial serves as an introduction to using ExecuTorch to deploy PyTorch models on Arm® Ethos™-U targets. It is based on `ethos_u_minimal_example.ipynb`, provided in Arm’s examples folder. |
| 33 | + |
| 34 | +## Prerequisites |
| 35 | + |
| 36 | +### Hardware |
| 37 | + |
| 38 | +To successfully complete this tutorial, you will need a Linux machine with aarch64 or x86_64 processor architecture, or a macOS™ machine with Apple® Silicon. |
| 39 | + |
| 40 | +To enable development without a specific development board, we will be using a [Fixed Virtual Platform (FVP)](https://www.arm.com/products/development-tools/simulation/fixed-virtual-platforms), simulating [Arm® Corstone™-300](https://developer.arm.com/Processors/Corstone-300)(cs300) and [Arm® Corstone™-300](https://developer.arm.com/Processors/Corstone-320)(cs320)systems. Think of it as virtual hardware. |
| 41 | + |
| 42 | +### Software |
| 43 | + |
| 44 | +First, you will need to install ExecuTorch. Please follow the recommended tutorials to set up a working ExecuTorch development environment. |
| 45 | + |
| 46 | +In addition to this, you need to install a number of SDK dependencies for generating Ethos-U command streams. Scripts to automate this are available in the main [ExecuTorch repository](https://github.com/pytorch/executorch/tree/main/examples/arm/). |
| 47 | +To install Ethos-U dependencies, run |
| 48 | +```bash |
| 49 | +./examples/arm/setup.sh --i-agree-to-the-contained-eula |
| 50 | +``` |
| 51 | +This will install: |
| 52 | +- [TOSA Serialization Library](https://www.mlplatform.org/tosa/software.html) for serializing the Exir IR graph into TOSA IR. |
| 53 | +- [Ethos-U Vela graph compiler](https://pypi.org/project/ethos-u-vela/) for compiling TOSA flatbuffers into a Ethos-U command stream. |
| 54 | +- [Arm GNU Toolchain](https://developer.arm.com/Tools%20and%20Software/GNU%20Toolchain) for cross compilation. |
| 55 | +- [Corstone SSE-300 FVP](https://developer.arm.com/documentation/100966/1128/Arm--Corstone-SSE-300-FVP) for testing on Ethos-U55 reference design. |
| 56 | +- [Corstone SSE-320 FVP](https://developer.arm.com/documentation/109760/0000/SSE-320-FVP) for testing on Ethos-U85 reference design. |
| 57 | + |
| 58 | +## Set Up the Developer Environment |
| 59 | + |
| 60 | +The setup.sh script generates a setup_path.sh script that you need to source whenever you restart your shell. Run: |
| 61 | + |
| 62 | +```{bash} |
| 63 | +source examples/arm/ethos-u-scratch/setup_path.sh |
| 64 | +``` |
| 65 | + |
| 66 | +As a simple check that your environment is set up correctly, run `which FVP_Corstone_SSE-320` and make sure that the executable is located where you expect, in the `examples/arm` tree. |
| 67 | + |
| 68 | +## Build |
| 69 | + |
| 70 | +### Ahead-of-Time (AOT) components |
| 71 | + |
| 72 | +The ExecuTorch Ahead-of-Time (AOT) pipeline takes a PyTorch Model (a `torch.nn.Module`) and produces a `.pte` binary file, which is then consumed by the ExecuTorch Runtime. This [document](getting-started-architecture.md) goes in much more depth about the ExecuTorch software stack for both AoT as well as Runtime. |
| 73 | + |
| 74 | +The example below shows how to quantize a model consisting of a single addition, and export it it through the AOT flow using the EthosU backend. For more details, see `examples/arm/ethos_u_minimal_example.ipynb`. |
| 75 | + |
| 76 | +```python |
| 77 | +import torch |
| 78 | + |
| 79 | +class Add(torch.nn.Module): |
| 80 | + def forward(self, x: torch.Tensor, y: torch.Tensor) -> torch.Tensor: |
| 81 | + return x + y |
| 82 | + |
| 83 | +example_inputs = (torch.ones(1,1,1,1),torch.ones(1,1,1,1)) |
| 84 | + |
| 85 | +model = Add() |
| 86 | +model = model.eval() |
| 87 | +exported_program = torch.export.export(model, example_inputs) |
| 88 | +graph_module = exported_program.module() |
| 89 | + |
| 90 | + |
| 91 | +from executorch.backends.arm.ethosu import EthosUCompileSpec |
| 92 | +from executorch.backends.arm.quantizer import ( |
| 93 | + EthosUQuantizer, |
| 94 | + get_symmetric_quantization_config, |
| 95 | +) |
| 96 | +from torchao.quantization.pt2e.quantize_pt2e import convert_pt2e, prepare_pt2e |
| 97 | + |
| 98 | +# Create a compilation spec describing the target for configuring the quantizer |
| 99 | +# Some args are used by the Arm Vela graph compiler later in the example. Refer to Arm Vela documentation for an |
| 100 | +# explanation of its flags: https://gitlab.arm.com/artificial-intelligence/ethos-u/ethos-u-vela/-/blob/main/OPTIONS.md |
| 101 | +compile_spec = EthosUCompileSpec( |
| 102 | + target="ethos-u55-128", |
| 103 | + system_config="Ethos_U55_High_End_Embedded", |
| 104 | + memory_mode="Shared_Sram", |
| 105 | + extra_flags=["--output-format=raw", "--debug-force-regor"] |
| 106 | + ) |
| 107 | + |
| 108 | +# Create and configure quantizer to use a symmetric quantization config globally on all nodes |
| 109 | +quantizer = EthosUQuantizer(compile_spec) |
| 110 | +operator_config = get_symmetric_quantization_config() |
| 111 | +quantizer.set_global(operator_config) |
| 112 | + |
| 113 | +# Post training quantization |
| 114 | +quantized_graph_module = prepare_pt2e(graph_module, quantizer) |
| 115 | +quantized_graph_module(*example_inputs) # Calibrate the graph module with the example input |
| 116 | +quantized_graph_module = convert_pt2e(quantized_graph_module) |
| 117 | + |
| 118 | + |
| 119 | +# Create a new exported program using the quantized_graph_module |
| 120 | +quantized_exported_program = torch.export.export(quantized_graph_module, example_inputs) |
| 121 | +from executorch.backends.arm.ethosu import EthosUPartitioner |
| 122 | +from executorch.exir import ( |
| 123 | + EdgeCompileConfig, |
| 124 | + ExecutorchBackendConfig, |
| 125 | + to_edge_transform_and_lower, |
| 126 | +) |
| 127 | +from executorch.extension.export_util.utils import save_pte_program |
| 128 | + |
| 129 | +# Create partitioner from compile spec |
| 130 | +partitioner = EthosUPartitioner(compile_spec) |
| 131 | + |
| 132 | +# Lower the exported program to the Ethos-U backend |
| 133 | +edge_program_manager = to_edge_transform_and_lower( |
| 134 | + quantized_exported_program, |
| 135 | + partitioner=[partitioner], |
| 136 | + compile_config=EdgeCompileConfig( |
| 137 | + _check_ir_validity=False, |
| 138 | + ), |
| 139 | + ) |
| 140 | + |
| 141 | +# Convert edge program to executorch |
| 142 | +executorch_program_manager = edge_program_manager.to_executorch( |
| 143 | + config=ExecutorchBackendConfig(extract_delegate_segments=False) |
| 144 | + ) |
| 145 | + |
| 146 | + |
| 147 | +# Save pte file |
| 148 | +save_pte_program(executorch_program_manager, "ethos_u_minimal_example.pte") |
| 149 | +``` |
| 150 | + |
| 151 | + |
| 152 | +```{tip} |
| 153 | +For a quick start, you can use the script `examples/arm/aot_arm_compiler.py` to produce the pte. |
| 154 | +To produce a pte file equivalent to the one above, run |
| 155 | +`python -m examples.arm.aot_arm_compiler --model_name=add --delegate --quantize --output=ethos_u_minimal_example.pte` |
| 156 | +``` |
| 157 | + |
| 158 | +### Runtime: |
| 159 | + |
| 160 | +After the AOT compilation flow is done, the runtime can be cross compiled and linked to the produced `.pte`-file using the Arm cross-compilation toolchain. This is done in two steps: |
| 161 | + |
| 162 | +First, build and install the ExecuTorch libraries and EthosUDelegate: |
| 163 | +``` |
| 164 | +# In ExecuTorch top-level, with sourced setup_path.sh |
| 165 | +cmake -DCMAKE_BUILD_TYPE=Release --preset arm-baremetal -B cmake-out-arm . |
| 166 | +cmake --build cmake-out-arm --target install -j$(nproc) |
| 167 | +``` |
| 168 | +Second, build and link the `arm_executor_runner` and generate kernel bindings for any non delegated ops. This is the actual program that will run on target. |
| 169 | + |
| 170 | +``` |
| 171 | +# In ExecuTorch top-level, with sourced setup_path.sh |
| 172 | +cmake -DCMAKE_TOOLCHAIN_FILE=`pwd`/examples/arm/ethos-u-setup/arm-none-eabi-gcc.cmake \ |
| 173 | + -DCMAKE_BUILD_TYPE=Release \ |
| 174 | + -DET_PTE_FILE_PATH=ethos_u_minimal_example.pte \ |
| 175 | + -DTARGET_CPU=cortex-m55 \ |
| 176 | + -DETHOSU_TARGET_NPU_CONFIG=ethos-u55-128 \ |
| 177 | + -DMEMORY_MODE=Shared_Sram \ |
| 178 | + -DSYSTEM_CONFIG=Ethos_U55_High_End_Embedded \ |
| 179 | + -Bethos_u_minimal_example \ |
| 180 | + examples/arm/executor_runner |
| 181 | +cmake --build ethos_u_minimal_example -j$(nproc) -- arm_executor_runner |
| 182 | +``` |
| 183 | + |
| 184 | +```{tip} |
| 185 | +For a quick start, you can use the script `backends/arm/scripts/build_executor_runner.sh` to build the runner. |
| 186 | +To build a runner equivalent to the one above, run |
| 187 | +`./backends/arm/scripts/build_executor_runner.sh --pte=ethos_u_minimal_example.pte` |
| 188 | +```` |
| 189 | +
|
| 190 | +The block diagram below shows, at the high level, how the various build artifacts are generated and are linked together to generate the final bare-metal executable. |
| 191 | +
|
| 192 | + |
| 193 | +
|
| 194 | +
|
| 195 | +
|
| 196 | +## Running on Corstone FVP Platforms |
| 197 | +
|
| 198 | +Finally, use the `backends/arm/scripts/run_fvp.sh` utility script to run the .elf-file on simulated Arm hardware. |
| 199 | +``` |
| 200 | +backends/arm/scripts/run_fvp.sh --elf=$(find ethos_u_minimal_example -name arm_executor_runner) --target=ethos-u55-128 |
| 201 | +``` |
| 202 | +The example application is by default built with an input of ones, so the expected result of the quantized addition should be close to 2. |
| 203 | +
|
| 204 | +
|
| 205 | +## Takeaways |
| 206 | +
|
| 207 | +In this tutorial you have learned how to use ExecuTorch to export a PyTorch model to an executable that can run on an embedded target, and then run that executable on simulated hardware. |
| 208 | +To learn more, check out these learning paths: |
| 209 | +
|
| 210 | +https://learn.arm.com/learning-paths/embedded-and-microcontrollers/rpi-llama3/ |
| 211 | +https://learn.arm.com/learning-paths/embedded-and-microcontrollers/visualizing-ethos-u-performance/ |
| 212 | +
|
| 213 | +## FAQs |
| 214 | +
|
| 215 | +If you encountered any bugs or issues following this tutorial please file a bug/issue here on [Github](https://github.com/pytorch/executorch/issues/new). |
| 216 | +
|
| 217 | +
|
| 218 | +``` |
| 219 | +Arm is a registered trademark of Arm Limited (or its subsidiaries or affiliates). |
| 220 | +``` |
0 commit comments