Skip to content

Conversation

folkertdev
Copy link
Contributor

tracking issue: #146941
acp: rust-lang/libs-team#638

well, we don't expose prefetch_write_instruction, that one doesn't really make sense in practice.

The implementation is straightforward, the docs can probably use some tweaks. Especially for the instruction version it's a little awkward.

r? @Amanieu

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Sep 23, 2025
@rust-log-analyzer

This comment has been minimized.

/// Passing a dangling or invalid pointer is permitted: the memory will not
/// actually be dereferenced, and no faults are raised.
#[unstable(feature = "hint_prefetch", issue = "146941")]
pub const fn prefetch_read_instruction<T>(ptr: *const T, locality: Locality) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be ptr: unsafe fn() or something since some platforms have different data and instruction pointer sizes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On some platforms a function pointer doesn't point directly to the instruction bytes, but rather to a function descriptor, which consists of a pointer to the first instruction and some value that needs to be loaded into a register. On these platforms using unsafe fn() would be incorrect. Itanium is an example, but I know there are more architectures that do this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, but that doesn't mean *const T is correct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ultimately all you need is an address, so *const T seemed the simplest way of achieving that.

Copy link
Member

@programmerjake programmerjake Sep 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but *const T may be too small e.g. on 16-bit x86 in the medium model a data pointer is 16 bits but an instruction pointer is 32 bits.

there are some AVR cpus (not currently supported by rust?) which need >16 bits for instruction addresses but not for data, so they might have the same issue https://en.wikipedia.org/wiki/Atmel_AVR_instruction_set#:~:text=Rare)%20models%20with,zero%2Dextended%20Z.)

Copy link
Member

@programmerjake programmerjake Sep 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, well in that case the current implementation would not do anything, the LLVM intrinsic accepts a ptr

https://llvm.org/docs/LangRef.html#llvm-prefetch-intrinsic

ptr can have different address spaces, e.g. https://github.com/llvm/llvm-project/blob/37de695cb1de65dd16f589cdeae50008d1f91d4d/llvm/test/CodeGen/AMDGPU/llvm.prefetch.ll
each different address space can have pointers be a different size (e.g. AMDGPU). code pointers can be in a different address space than data pointers.

I'm fine with something like *const Code for now where Code is the extern type in ACP 589 since we can add a type alias and change the type on platforms where that doesn't work, for details see rust-lang/libs-team#589

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are also some fairly niche targets, for which I'm assuming prefetching (generally a performance measure) isn't very relevant.

16-bit x86 can definitely benefit from prefetching -- it can still run on all modern x86 cpus

Copy link
Contributor Author

@folkertdev folkertdev Sep 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that ACP actually use the LLVM address spaces? It's not really clear from the design. Also it looks like it was never actually nominated for T-lang?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LLVM address space usage is dictated by the target, that ACP doesn't use non-default address-spaces because for all existing targets a NonNull<Code> is sufficient for function addresses (AVR just uses 16-bit pointers for both code and data and AFAIK LLVM doesn't currently support >16-bit pointers), however the plan is to add a type BikeshedFnAddr and switch to using that whenever we add a target where that's insufficient.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AVR does use ptr addrspace(1) for function pointers: https://rust.godbolt.org/z/3hGPfKvfG

@Amanieu
Copy link
Member

Amanieu commented Sep 24, 2025

After thinking about this for a bit, NonTemporal should be separated into a separate Retention enum since it is orthogonal to the locality at which to prefetch the data. Specifically:

  • "locality" refers to how soon we are going to need this data. This corresponds to the cache level into which we are prefetching.
  • "retention" refers to how long the data should be kept in cache. A non-temporal access includes a hint to the cache that the line should be evicted before any other cache lines. Non-temporal hints indicate memory that is accessed only once, after which it should not be kept in the cache any more.

So I would rework the API to something like this:

#[non_exhaustive]
pub enum Locality {
    L1,
    L2,
    L3,
}

#[non_exhaustive]
pub enum Retention {
    Normal,
    NonTemporal,
}

pub const fn prefetch_read_data<T>(ptr: *const T, locality: Locality, retention: Retention);

Even though not all of these map to the underlying LLVM intrinsic today, they may do so in the future.

/// Passing a dangling or invalid pointer is permitted: the memory will not
/// actually be dereferenced, and no faults are raised.
#[unstable(feature = "hint_prefetch", issue = "146941")]
pub const fn prefetch_write_data<T>(ptr: *mut T, locality: Locality) {
Copy link
Member

@bjorn3 bjorn3 Sep 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe make Locality a const generic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enums cannot be const-generic parameters at the moment (on stable, anyway). We model the API here after atomic operations where the ordering parameter behaves similarly.

@folkertdev
Copy link
Contributor Author

Even though not all of these map to the underlying LLVM intrinsic today, they may do so in the future.

Maybe my understanding of NonTemporal is wrong, but I believe it means that the cache hierarchy should be skipped entirely. So then combining that with a Locality is completely meaningless, right?

It can be implemented (we'd just ignore weird/invalid combinations, I guess) but from an API perspective it seems weird.

@Amanieu
Copy link
Member

Amanieu commented Sep 24, 2025

No, non-temporal is a hint that the data is likely only going to be accessed once. Essentially if you have data that you're only reading once then you'll want to prefetch it all the way to L1, but then mark that cache line as the first that should be evicted if needed since you know it won't be needed in the future. See https://stackoverflow.com/questions/53270421/difference-between-prefetch-and-prefetchnta-instructions for details of how this works on x86 CPUs.

@folkertdev
Copy link
Contributor Author

So that means something like this?

#[inline(always)]
#[unstable(feature = "hint_prefetch", issue = "146941")]
pub const fn prefetch_read_data<T>(ptr: *const T, locality: Locality, retention: Retention) {
    match retention
        Retention::NonTemporal => {
            return intrinsics::prefetch_read_data::<T, { Retention::NonTemporal as i32 }>(ptr);
        }
        Retention::Normal => { /* fall through */ }
    }

    match locality {
        Locality::L3 => intrinsics::prefetch_read_data::<T, { Locality::L3 as i32 }>(ptr),
        Locality::L2 => intrinsics::prefetch_read_data::<T, { Locality::L2 as i32 }>(ptr),
        Locality::L1 => intrinsics::prefetch_read_data::<T, { Locality::L1 as i32 }>(ptr),
    }
}

This is really tricky to document: users basically have to look at the implementation to see what happens exactly. Also, every call getting the additional retention parameter is kind of unfortunate.

@Amanieu
Copy link
Member

Amanieu commented Sep 26, 2025

My main concern is that the cache level to prefetch into should not be mixed with the retention hint. It should be a separate parameter or a separate function altogether.

@folkertdev
Copy link
Contributor Author

In that case I think, given current hardware support at least, that a separate function would be better

#[inline(always)]
#[unstable(feature = "hint_prefetch", issue = "146941")]
pub const fn prefetch_read_data<T>(ptr: *const T, locality: Locality) {
    match locality {
        Locality::L3 => intrinsics::prefetch_read_data::<T, { Locality::L3 as i32 }>(ptr),
        Locality::L2 => intrinsics::prefetch_read_data::<T, { Locality::L2 as i32 }>(ptr),
        Locality::L1 => intrinsics::prefetch_read_data::<T, { Locality::L1 as i32 }>(ptr),
    }
}

#[inline(always)]
#[unstable(feature = "hint_prefetch", issue = "146941")]
pub const fn prefetch_read_data_nontemporal<T>(ptr: *const T) {
    return intrinsics::prefetch_read_data::<T, { Retention::NonTemporal as i32 }>(ptr);
}

that does potentially close some doors for weird future hardware designs, but as a user I think separate functions are simpler.

well, we don't expose `prefetch_write_instruction`, that one doesn't really make sense in practice.
@rustbot
Copy link
Collaborator

rustbot commented Oct 4, 2025

This PR was rebased onto a different master commit. Here's a range-diff highlighting what actually changed.

Rebasing is a normal part of keeping PRs up to date, so no action is needed—this note is just to help reviewers.

@rust-log-analyzer

This comment has been minimized.

@Amanieu
Copy link
Member

Amanieu commented Oct 5, 2025

The reason I argued for a separate argument is that it's possible LLVM will add support for specifying a cache level for non-temporal prefetches in the future. It also makes the API more symmetrical.

Alternatively, we could also decide to only expose prefetch hints with no extra arguments and point people to platform-specific hints in std::arch for more detailed hints.

@folkertdev
Copy link
Contributor Author

How heavily should we weigh potential future LLVM additions? Apparently no current architecture provides the fine-grained control of picking the cache level for non-temporal reads. So we're trading additional complexity for everyone versus a hypothetical future CPU capability.

Also, we've gotten this far without prefetching at all. I suspect that in practice the vast majority of uses will just be "load into L1", perhaps with some "load into L2". The heavily specialized stuff can probably just be left to stdarch.

The current implementation of this PR is to have

pub const fn prefetch_read_data<T>(ptr: *const T, locality: Locality);
pub const fn prefetch_read_data_nontemporal<T>(ptr: *const T);

I've left out the non-temporal variants for write and read_instruction for now, from what I can tell those don't actually seem that useful and can probably be left to stdarch unless someone does have an actual use case.

@programmerjake
Copy link
Member

programmerjake commented Oct 5, 2025

for streaming writes where you're unlikely to access the written data again in the near future, prefetch_write_data_nontemporal seems useful, at least it doesn't have crazy semantics like nontemporal stores do.

@programmerjake
Copy link
Member

also, for naming, imo we should leave out _data since that's likely waay more common than _instruction so makes a good default.

@Amanieu
Copy link
Member

Amanieu commented Oct 5, 2025

How heavily should we weigh potential future LLVM additions? Apparently no current architecture provides the fine-grained control of picking the cache level for non-temporal reads. So we're trading additional complexity for everyone versus a hypothetical future CPU capability.

AArch64 has this capability, see https://developer.arm.com/documentation/ddi0596/2021-06/Base-Instructions/PRFM--immediate---Prefetch-Memory--immediate--

@folkertdev
Copy link
Contributor Author

for streaming writes where you're unlikely to access the written data again in the near future, prefetch_write_data_nontemporal seems useful, at least it doesn't have crazy semantics like nontemporal stores do.

Can't you just do the non-temporal store? what benefit does a prefetch provide here?

also, for naming, imo we should leave out _data since that's likely waay more common than _instruction so makes a good default.

Yeah I had been thinking that too, I'll change that.

AArch64 has this capability

You can encode it in the instruction, I haven't been able to figure out whether it actually does anything in practice.

We can add the locality argument to the non-temporal function(s) though, I'd be OK with that given that non-temporal is even more niche than standard prefetches.

@programmerjake
Copy link
Member

programmerjake commented Oct 5, 2025

Can't you just do the non-temporal store?

non-temporal stores break the memory model on x86: llvm/llvm-project#64521 and #114582

@folkertdev
Copy link
Contributor Author

And then the idea is that a non-temporal prefetch write hint plus a standard write will in effect create a well-behaved non-temporal store?

@programmerjake
Copy link
Member

And then the idea is that a non-temporal prefetch write hint plus a standard write will in effect create a well-behaved non-temporal store?

maybe, depending on the arch? it at least won't break the memory model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants