-
Notifications
You must be signed in to change notification settings - Fork 2.2k
cpu/drcbearm64.cpp: Optimised conditional operations using carry flag #13484
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Track the state of the native carry flag to avoid unnecessarily manipulating the native NZCV register. If the native carry flag does not correspond to the UML carry flag, test the bit in the flags register for the C and NC conditions. Fixed EXIT with C/NC/A/BE condition not working properly if it doesn't immediately follow a CMP or SUB. Extended reach of conditional EXIT to +/-128MiB (was +/-1MiB for conditions other than U/NU). Moved code to set up skipping conditional instructions to a common function. Use TBZ/TBNZ for short backward jumps with U/NU/C/NC conditions to save one instruction and a temporary register.
Avoids an unnecessary register copy when one operand is in memory and the other is a small immediate value. Also fixed another unnecessary register copy for SUB[B] when an operand is kept in a host register.
Conditional forms of MOV and FMOV can be efficiently implemented using conditional select instructions when the condition is handled using native NZCV flags, the destination UML register is kept in a host register, and the source is kept in a host register or is a simple immediate value. Immediate integer -1/0/1 can be generated directly using CSINV/CSEL/CSINC with xzr/wzr as a source operand.
Ill get the lastest master and add this patch tonight before bed just a few things to do atm will get it done though. |
No problem. There’s no deadline. If I’ve screwed something up, it might take a few iterations to sort out. |
Luckily it was a quick compile
|
Thanks.
Yeah, I’ve been working on keeping dependencies under control – this patch only requires one file to be recompiled. So that’s a 7% improvement for Did you play all three for a bit as a sanity check? |
was over ssh ill play each for a bit and get back to you if i see any issues |
Here we go, here’s
And this is the equivalent snippet for the E1-16 global register write, courtesy of
(The UML in comments to help you follow along was added in #13472.) That’s much prettier than it was before. |
For something meatier, Toy Fighter (Naomi) went from 170% to 196% on the M3 Max with this change. It now can run multi-second sequences of the attract mode without dropping a frame, which is awesome. |
Yay! Now if I could just work out where that sub-optimal |
fiveside and soldivid are fine ill be playing more of soldivid is a really decent game. coolmini doesn't start after selecting tnt game says ready then freezes this is true from mame0275, current master and this pull request it does work with -nodrc |
The coolmini tnt minigame plays OK for me. |
Hmm, if it’s not a regression from this PR, it shouldn’t hold this up, but it will need to be tracked down later. It’s a bit concerning that it’s working on one ARM system and not another, though. I think the issue causing the sub-optimal
To solve this, something needs to mask 32-bit UML instructions’ immediate operands to 32-bits. I’m not going to try tackling that tonight. I still noticed some other fuckery I can do to slightly optimise |
Also put my name on the files - I think I've contributed enough to it now.
I cleared the cfg, hiscore and nvram just to be sure it wasnt an issue in there its still happens on the pi5 hopefully more people can test. Seems to be working after 1c86e1a for the pi 5 |
Ah, dammit. I just realised it’s possible to optimise |
Here’s an example of conditional select instructions now being used in PowerPC exception handling:
Previously it looked like:
That should be less of a drain on branch prediction resources. And here it’s getting “clever” generating a constant 1 for a conditional
|
Just want to clarify it is crashing even after todays latest commit. The other day I didnt skip the how to play screen and and used the keys and assumed it was working because the keys where taking input. Its the screen right after the how to play crashes when it says ready. I did record a video but there is a 25mb limit on here so cant post it. |
If you look at the current code generated by the AArch64 recompiler back-end, you see stuff like this (example is the “fastram” checks for a 32-bit read in
fiveside
):Notice that it:
Now this is pretty bad for performance. The
msr
is likely to cause a pipeline stall on the next conditional branch that depends on the condition codes (as opposed to conditional branches that depend on GPRs). Even if it doesn’t cause a pipeline stall, the only thing between thecset
and themsr
can be reordered or parallelised is executing/retiring themrs
early. That means the branch will be liable for a five-cycle penalty on misprediction. This really takes the “fast” out of “fastram”.You see similar patterns in other places, for example here’s code generated by the Hyperstone E1 CPU core for setting a global register where it’s checking whether it needs to deal with an extended register or the status register:
So this pattern is happening pretty frequently, and often multiple times in succession.
Now it should be possible to avoid this:
CMP
. If they follow aCMP
(orSUB
/SUBB
) with no intervening operations that affect flags or labels, the native carry flag will already be in the desired state, so there’s no need to mess with it.C
orNC
condition follows anADD
/ADDC
with no intervening operations that affect flags or labels, it’s still possible to translate to a native conditional branch just by inverting the condition.C
flag, you can implement theC
andNC
conditions in terms of the UMLC
flag stored in x28 and avoid the expensive system register modification.This PR adds basic inter-instruction optimisation to deal with this. The state of the native carry flag as it corresponds (or doesn’t correspond) to the UML
C
flag is tracked, and the strategy for conditional operations is chosen accordingly.There are also some other fixes, optimisations an adjustments:
U
/NU
conditions has been slightly optimised.csel
/csinc
/csinv
/fcssel
) to implement conditional forms ofMOV
andFMOV
when advantageous.EXIT
would misbehave have been fixed (we never hit these because no CPU front-end uses conditionalEXIT
anyway).EXIT
has been extended to ±128 MiB (it was ±1 MiB for most conditions, which is a bit short). It would be possible to make it a bit more efficient by checking the actual distance to theEXIT
landing pad, that would mean it couldn’t use the common skip setup function, and I really doubt it would be worthwhile, even if front-ends actually did use conditionalEXIT
.ADD
/ADDC
/SUB
/SUBB
code generation, avoiding an unnecessary move in some cases and reducing the number of temporary registers used in others.It’s entirely possible I’ve missed something.
Can someone with some kind of ARM Cortex-A system set up for testing MAME (maybe @danmons or @grant2258 if you have some time) please check some things for me?
fiveside
(PowerPC 403),coolmini
(E1-16) andsoldivid
(SH-2) are still working.-bench 90 fiveside
-bench 90 coolmini
-bench 90 soldivid
-drc_log_native -bench 1 fiveside
-drc_log_native -bench 1 coolmini
-drc_log_native -bench 1 soldivid
If stuff’s broken, I’ll do my best to work with people to get it sorted out. I really don’t like leaving performance on the table.