Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add assembly version of simple operations on aarch64 #459

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

tgross35
Copy link
Contributor

@tgross35 tgross35 commented Jan 23, 2025

For aarch64 and arm64ec with Neon, add assembly versions of the following:

  • ceil
  • ceilf
  • fabs
  • fabsf
  • floor
  • floorf
  • fma
  • fmaf
  • round
  • roundf
  • sqrt
  • sqrtf
  • trunc
  • truncf

If the fp16 target feature is available, which implies neon, also include the following:

  • ceilf16
  • fabsf16
  • floorf16
  • rintf16
  • sqrtf16
  • truncf16

Additionally, replace core::arch versions of the following with handwritten assembly (which avoids issues with aarch64be):

  • rint
  • rintf

Instructions for fmax and fmin are also available but seem to provide different results based on whether NaN inputs are signaling or quiet. Our current implementation does not do this, so omit these for now.

@tgross35 tgross35 force-pushed the aarch64-asm branch 4 times, most recently from 5d0075b to 51718a1 Compare January 23, 2025 02:12
@tgross35
Copy link
Contributor Author

@Amanieu would you mind double checking the assembly in src/math/arch/aarch64.rs? I am unsure whether preserves_flags should be set, I believe some of these operations may set flags based on the exception control register.

Cc @hanna-kruppe, while I was working on the others I also replaced the rint vector implementation.

Copy link
Member

@Amanieu Amanieu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For fmin/fmax you can use the fminnm/fmaxnm instructions which map to IEEE minNum/maxNum.

Also you will want to make these impls conditional on the fp target feature so that these are not used on soft-float targets.

However I'm then questioning how useful these are on hard-float targets: the standard library will invoke the LLVM intrinsic which will lower to the instruction, so the libm function will never be called. If this is only for compiler-builtins then it might be better to keep libm soft-float only.

However I am questioning

pub fn rint(mut x: f64) -> f64 {
unsafe {
asm!(
"frintx {x:d}, {x:d}",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You want either:

  • frintn to use round-to-nearest, ties to even.
  • frinti to use the current rounding mode in fpcr.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, rint should follow the rounding mode so I guess frinti is more correct to the C spec. Updated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually rint may optionally raise FE_INEXACT so I think frintx might have worked? Irrelevant for Rust in any case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is only for Rust and should never be picked up by C code, I’d argue for frintn. Rust does not support other rounding modes nor FP exceptions, and if someone ignores that and e.g. causes UB by configuring a non-default rounding mode then it’s better if they get unexpected results immediately than if it appears to work in simple cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(If the symbols from libm-via-compiler_builtins do get picked up by C code compiled with FENV_ACCESS enabled, then there’s a much bigger problem because none of the Rust code in libm can support that.)

@hanna-kruppe
Copy link
Contributor

hanna-kruppe commented Jan 23, 2025

However I'm then questioning how useful these are on hard-float targets: the standard library will invoke the LLVM intrinsic which will lower to the instruction, so the libm function will never be called. If this is only for compiler-builtins then it might be better to keep libm soft-float only.

At least some of these are used internally within libm by functions that still need to exist on hard-float targets. For example, floor is used by rem_pio2_large which is needed by many trigonometric functions.

(Plus the benefits for non-compiler-builtins consumers, who are not the main point of this crate but it’s still nice-to-have.)

@Amanieu
Copy link
Member

Amanieu commented Jan 23, 2025

For the operations that are used internally, the ideal end state that we want is for libm to use the float methods from core, which will then be lowered by LLVM to the appropriate instructions.

@hanna-kruppe
Copy link
Contributor

Is there any harm in taking the improvement now and revisiting once those methods are actually available in core?

For aarch64 and arm64ec with Neon, add assembly versions of the
following:

* `ceil`
* `ceilf`
* `fabs`
* `fabsf`
* `floor`
* `floorf`
* `fma`
* `fmaf`
* `round`
* `roundf`
* `sqrt`
* `sqrtf`
* `trunc`
* `truncf`

If the `fp16` target feature is available, which implies `neon`, also
include the following:

* `ceilf16`
* `fabsf16`
* `floorf16`
* `rintf16`
* `roundf16`
* `sqrtf16`
* `truncf16`

Additionally, replace `core::arch` versions of the following with
handwritten assembly (which avoids issues with `aarch64be`):

* `rint`
* `rintf`

Instructions for `fmax` and `fmin` are also available but seem to
provide different results based on whether NaN inputs are signaling or
quiet. Our current implementation does not do this, so omit these for
now.
@tgross35
Copy link
Contributor Author

My only motivation here is fma - some of the incoming CORE-math routines rely on it, I wanted to have a more accurate icount comparison without soft fma before mul_add is available in core. Nothing else is important, I just included the other simple ops since they are reasonably trivial.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants