Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request for clarifications regarding hardware capabilities for shared mappings #7

Open
lukewagner opened this issue Dec 17, 2024 · 8 comments

Comments

@lukewagner
Copy link

In the target platforms doc, there is a section explaining that the E-SIG feels comfortable assuming an MPU or MMU for efficient implementations. This is very helpful since it gives us a baseline of memory-related hardware capabilities that we can assume when considering implementations strategies for various features.

Just to state some assumptions to see if I'm understanding these terms correctly so folks can correct me if I'm wrong:

  • an MMU includes some sort of page table for mapping virtual addresses to physical addresses allowing page-granularity control over both access-rights as well as sharing (where multiple virtual addresses map to the same physical address);
  • an MPU includes a finite set of protection registers that can be used to configure at runtime access-rights for a finite set of address ranges.

Based on these assumptions, one corollary is that any feature that requires fine-grained/page-level control over an unbounded set of memory ranges wouldn't work on an MPU and thus would not work across all the target platforms.

But one thing that's unclear to me is whether an MPU allows mapping a finite set of distinct "virtual" addresses to the same "physical" page of RAM. Hypothetically, it seems like an MPU could do this without blowing its limited hardware budget by including an "offset" field in its per-region configuration state that was added as part of address translation. With this hardware capability, we could, e.g., efficiently support a small finite number of "mmap()s" (or features that want to be implemented in terms of mmap()). However, scanning the docs of the popular ARM Cortex-M MPU, it seems like at least this one popular MPU can't do such shared mappings. However, I've also heard that various extensions outside the official ARM Cortex-M MPU extension might allow this? Also, I have no idea about the wider world of MPUs.

So, what I'd love to get a clarification on (here and, ideally, in the doc) is whether the E-SIG believes that such shared mappings are indeed possible across the target platforms' MPUs or perhaps whether we want to raise the baseline to assume an MMU.

One discussion where this is concretely relevant is in the WebAssembly CG memory-control proposal (and its collection of sub-proposals), e.g., memory-control/#19, which does seem to assume an MMU.

@lum1n0us
Copy link

IMO, some RTOS and bare-metal applications need memory protection, but they don't require full virtual memory support. Therefore, having a MMU is optional.

In scenarios without an MMU, a few customers opt to extend the sandbox to cover the entire memory, or in other words, disable the boundary check.

Another interesting case is with another customer's product, where a Wasm application and its native libraries can use virtual memory and DMA to access two different address spaces.

@woodsmc
Copy link
Contributor

woodsmc commented Dec 20, 2024

Thanks for raising this, your right @lukewagner we could do more to clarify this. This is a good issue to also bounce of the e-sig mailing list, and I'll do that. Also useful to bring this up on Zulip, just to get additional eyeballs on it.

E-SIG Hardware Platforms and Memory Control Hardware

The e-sig stated that the devices that the we should target would come with a minimum of 512kb of RAM and 1mb of storage. Finding devices with this much RAM without any form of hardware memory protection was hard. If the device has more than a handful of kb of RAM it will undoubtedly come with either an MPU or an MMU. Based on the e-sig's hardware list (see below) of the 5 platforms we've selected 3 come with an MMU and 2 with an MPU.

What does this mean in practice ?

So, what I'd love to get a clarification on (here and, ideally, in the doc) is whether the E-SIG believes that such shared mappings are indeed possible across the target platforms' MPUs or perhaps whether we want to raise the baseline to assume an MMU.

I think it is probably not possible to assume the presence of an MMU, and we should account for MMU or an MPU. We do not need to account for a world in which neither are present, as these devices are just too limited even for the E-SIG. - However, I'm be delighted to open this up to debate, and see what the other stakeholders in the E-SIG think - email promotion of this issue is coming ;)

There is an argument which says that the growing capabilities of MPUs and the growing prevalence of MMUs mean that the need to support limited MPU's is shrinking. - Hence why I'd like to open the discussion a little wider.

But, without a chip in from others, I'd assume that we should aim to support both an MMU and an MPU.

Picking Hardware for An Application

In my view @lum1n0us two examples provide the two extremes of the embedded world - The real truth is that the choice of hardware is application specific. if we've got an application that needs virtual memory, then selecting hardware with an MMU is preferable.

Impact on WASM - Functional Equivalence, Not identical functionality (Similar to Linux's Approach?)

So, if we can't assume that there is something MMU like on a device, what can we do? - Well, looking out at other projects we can take a queue from Linux. Which provides user land source if not binary compatibility.

Today it is possible to run Linux on devices without an MMU (uClinux for example), and in those situations Linux can provide the running application functional compatibility but with out the same side effects. For instance we can use mmap to in a function which look's like it is going to memory map access to a peripheral. If that device has an MMU then, that will be exactly what happens - and the mapping is implemented via hardware mapping. On a device without the MMU then the real physical address of the peripheral is simply returned, as the there is no mapping to do. On devices with an MPU, this function can be used to reconfigure the MPU to allow the running application to see the physical address, even if there is no hardware mapping occurring. - The kernel.org page here provides a pretty detailed explanation of what Linux is doing.

As @lum1n0us suggests something similar is needed in the embedded Wasm world. The same code should function as expected, but it is acceptable to encounter performance impacts with manual copies of memory between locations, or perhaps alternatively allow a weak sandbox where memory protection isn't present. But these are trade offs that can be made based on the application use case and cost point for hardware.

The important aspect is that the same code will operate successfully.

To aid in this discussion I've updated the selected hardware table (see below) with hardware management features.

ISA Board / Purchasable Hardware Memory Management Features
Xtensa ESP32-S3-BOX-3 MPU present Tensilica Xtensa LX7 Processor Datasheet
RISC-V 32 N/A - Suggestions welcome
ARM 32 STM32H747AG MPU present link
RISC-V 64 PINE64 Ox64 MMU present (RISC-V Ox64 BL808 SBC: Sv39 Memory Management Unit
ARM 64 Raspberry Pi 3 MMU present
x86-64 Lattepanda-v1 MMU present

Hypothetically, it seems like an MPU could do this without blowing its limited hardware budget by including an "offset" field in its per-region configuration state that was added as part of address translation. With this hardware capability, we could, e.g., efficiently support a small finite number of "mmap()s" (or features that want to be implemented in terms of mmap()). However, scanning the docs of the popular ARM Cortex-M MPU, it seems like at least this one popular MPU can't do such shared mappings.

This is an interesting idea. Perhaps Stephen from Aytm would have a view on this? I think Aytm's mainly focusing the ARM market at the moment.

@lukewagner
Copy link
Author

(Thanks for all the info so far and looking forward to seeing all the other responses.)

@squillace
Copy link

per The same code should function as expected, but it is acceptable to encounter performance impacts with manual copies of memory between locations, or perhaps alternatively allow a weak sandbox where memory protection isn't present. and in other words, disable the memory check I'm assuming this breaks the core spec, yes? Because understanding how to navigate that it's not core spec compatible will be a thing we should tackle head on.

For example, if:

  • we cannot run on a required hardware spec without breaking the core spec's security guarantees and
  • we cannot figure out a path to modify things such that it can be done then
  • it's possible we are creating the boundary "branding" between wasm and something else that is almost wasm but which doesn't have the same security guarantees. But it wouldn't be able to claim wasm runtime guarantees though it could claim to execute wasm code.

There really are lots of angles to explore prior to that, however, but it remains possible for sure.

That issue aside -- because it may be a no-op if there are in fact ways to finesse this as the component model evolves forward -- the hardware specificity we can tackle together is a great result of this document. Without it, we can't really change the spec as needed to address the issues. I love that the SIG is in fact digging into where things work, where they don't, and what the limits may actually be in the future (We do not need to account for a world in which neither are present, as these devices are just too limited even for the E-SIG. is definitely a limit! :-) )

@woodsmc
Copy link
Contributor

woodsmc commented Dec 21, 2024

per The same code should function as expected, but it is acceptable to encounter performance impacts with manual copies of memory between locations, or perhaps alternatively allow a weak sandbox where memory protection isn't present. and in other words, disable the memory check I'm assuming this breaks the core spec, yes? Because understanding how to navigate that it's not core spec compatible will be a thing we should tackle head on.

I was thinking more along the lines of what Linux is doing - and using this as a bit of an inspiration. There are two behaviours for the same function call. I'll explain the thought exercise here, but it could equally well live with the discussion over on the structured memory issue @lukewagner pointed to.

Considering Linux:

  1. On an MMU - The mmap call on an MMU device does map the memory range into the calling processes memory space.
  2. On an MPU - The mmap call on an MPU device returns the raw memory address (no mapping) but could, and should, if possible, provide memory protection around accessing that memory.

The results from the caller perspective are same -> they have access to the memory address they require. But the implementation is, of course different.

Thinking about WASM - Regarding Memory Protection for the returned pointer from mmap for MPU devices
This is of course with native non-wasm code. In a Wasm runtime we'd need to somehow track and accommodate that mmap pointer - there would be some housekeeping around this. This is worth a bit of brainstorming here to really explore the options. But it could be the case that the MPU is limited and cannot provide guarded memory access to the raw pointer returned by the call to mmap. Here (again this is the thought experiment) we could do one of two things:

  1. Drop the memory protection - this is roughly what @lum1n0us was indicating happens today for some WAMR users - and hey, this is alright because the hardware is application specific, and the risks known and understood. But I do get potential branding issues around claiming this as core-wasm / wasi concept.
  2. Enforce software bounds checking for memory accesses to this pointer to host memory - this isn't possible today without a core WASM change, but this could be achieved with a fat pointer - which would include a flag saying it was a native pointer, not an offset inside a memory index, and it would also include the bounds checking information - the permitted range. When these pointers are dereferenced they could then be verified. - This is what CHERI is doing in HW. Yes, there would be a performance hit - and that would be acceptable, if you wanted to the performance hit to go away, the use an MMU.

It is also this thinking (2) which pushes us toward the concept of just having a second memory mapped with mmap.

Note this is just brainstorming - all thought experiments, I can see avenues in which support from core-wasm would really help. That aside, I hope this brainstorming helps in explaining how some folks are approaching the constraints. Certainly, the performance hit is totally acceptable - in the embedded world the hardware is chosen and is application specific, it's not a general compute device, therefore if you really need to do a lot of memory sharing there is an argument that says - you should stump up and pay for hardware with an MMU, or accept the limitation.

@srberard
Copy link

srberard commented Jan 7, 2025

A couple of comments here.

First, regarding @squillace comment about it breaking the core spec. In my opinion, if the necessary HW is not provided it would be acceptable to have a performance impact provided the same code functions as expected and we maintain compliance with the core spec. I would also add that, optionally, the core spec requirements could be relaxed but this cannot be the default behavior of the runtime. This would enable someone to make an explicit decision on their platform regarding relaxing such requirements and whether or not that is appropriate for their specific use case. Again, this needs to be an explicit option and not default behavior.

Second, regarding RISC-V32, we have just started working with the ESP32-C6 series chips. These are dual-core 32-bit RISC-V CPUs and come a number of configurations. We've currently using the devkitc model. From what I can tell this does not have an MMU; instead, it has a physical memory protection (PMP) feature which appears to provide similar, but limited support for protecting memory as with an MPU.

Links:
ESP32-C6 Series Datasheet
CPU Technical Reference Manual

@woodsmc
Copy link
Contributor

woodsmc commented Jan 14, 2025

First, regarding @squillace comment about it breaking the core spec. In my opinion, if the necessary HW is not provided it would be acceptable to have a performance impact provided the same code functions as expected and we maintain compliance with the core spec. I would also add that, optionally, the core spec requirements could be relaxed but this cannot be the default behavior of the runtime. This would enable someone to make an explicit decision on their platform regarding relaxing such requirements and whether or not that is appropriate for their specific use case. Again, this needs to be an explicit option and not default behavior.

Agreed. This makes sense.

Second, regarding RISC-V32, we have just started working with the ESP32-C6 series chips. These are dual-core 32-bit RISC-V CPUs and come a number of configurations. We've currently using the devkitc model. From what I can tell this does not have an MMU; instead, it has a physical memory protection (PMP) feature which appears to provide similar, but limited support for protecting memory as with an MPU.

Ok, great! - I'll get an update to the document!

@lukewagner
Copy link
Author

(Back from holidays and catching up) Thanks for the replies!

So, summarizing what I'm seeing above to see if this makes sense to everyone (let me know if not):

  • We (still) do not want to assume MMU hardware.
  • We cannot assume that MPU hardware allows even limited ability to alias two virtual addresses to the same physical RAM.
  • We want default configurations to respect the "wasm" brand which means (however implemented) sandboxing to ensure that, by default, an errant core wasm load/store cannot poke random host bytes.
  • We accept that, when not supported by hardware, we'll accept perf overhead to enforce said sandboxing.

A follow-up question: when we say we'll accept "perf overhead" for sandboxing without hardware support, it seems like there are (at least) two potential degrees of perf overhead worth considering:

  • the modest overhead of, say, an extra branch or ALU op or two, as we see with the static protection subproposal (in the absence of MMU support) or this shared-heap WAMR feature; and
  • the significant overhead of a, say, a software-emulated page table.

If we accept significant overhead, that obviously gives us more flexibility in what we consider an acceptable solution, but my worry is that it'll be so slow as to make default (standards-compliant) wasm a nonviable option in practice. So my impression is that we should only accept modest performance overhead when an MMU is not available. This also speaks to the constraints brought up in the last CG presentation where, even outside embedded scenarios where an MMU is not physically present, there are plenty of other wasm execution contexts where an MMU is inacessible to the wasm engine.

Thoughts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants