Description
🚀 The feature, motivation and pitch
Measuring inference and load-time memory consumption is a common task when looking to use ExecuTorch. It's useful to determine if a model can actually run on consumer hardware and to check parity against other inference solutions. Users can profile their entire app, but it takes a bit of work to figure out what is owned by ET, and it doesn't provide any detailed breakdown of where the memory usage is coming from.
We have performance profiling in the ExecuTorch devtools. It would be good to also have memory profiling support. What exactly is captured at runtime needs some discussion, but as a user, I'd ideally want to know the following:
- What is the peak memory used by the framework (including delegates) during inference?
- How much is owned by ET?
- How much is owned by delegates?
Some nice to haves:
- How much (peak) memory did we use during model load?
- Can we take RSS delta measurements on supported systems?
There is some overlap between this and memory planning visualization. In my opinion, memory profiling tooling fills an important role that static memory planning visualization can't when it comes to delegates that dynamically allocate memory, and for tracking other dynamic memory allocations (such as from kernels or dynamic unbound tensors).
Alternatives
No response
Additional context
No response
RFC (Optional)
No response
cc @Jack-Khuu
Metadata
Metadata
Assignees
Labels
Type
Projects
Status