Skip to content

Conversation

cuavas
Copy link
Member

@cuavas cuavas commented Mar 16, 2025

If you look at the current code generated by the AArch64 recompiler back-end, you see stuff like this (example is the “fastram” checks for a 32-bit read in fiveside):

    ands w19, w19, 0x7FFFFFFF               ; 737A0072
    nop                                     ; 1F2003D5
    mov w9, w19                             ; E903132A
    mov w10, 0xFF01FFFF                     ; CA1FA012
    cmp w9, w10                             ; 3F010A6B
    cset x12, 5                             ; EC279F9A
    bfi x28, x12, 0, 1                      ; 9C0140B3
    mrs x12, 0xDA10                         ; 0C423BD5
    bfi x12, x28, 0x1D, 1                   ; 8C0363B3
    eor x12, x12, 0x20000000                ; 8C0163D2
    msr 0xDA10, x12                         ; 0C421BD5
    b.hi PC$1                               ; 08000054
    mov w9, w19                             ; E903132A
    mov w10, 0xFF000000                     ; 0AE0BF52
    cmp w9, w10                             ; 3F010A6B
    cset x12, 5                             ; EC279F9A
    bfi x28, x12, 0, 1                      ; 9C0140B3
    mrs x12, 0xDA10                         ; 0C423BD5
    bfi x12, x28, 0x1D, 1                   ; 8C0363B3
    eor x12, x12, 0x20000000                ; 8C0163D2
    msr 0xDA10, x12                         ; 0C421BD5
    b.lo PC$1                               ; 03000054
    nop                                     ; 1F2003D5
    mov x9, 0x555515FA7750                  ; 09EA8ED249BFA2F2A9AACAF2
    ldr w19, [x9, w19]                      ; 336973B8
    ldp x29, x30, [sp], 0x10                ; FD7BC1A8
    ret x30                                 ; C0035FD6
PC$1:

Notice that it:

  • Does a comparison
  • Stashes the complement of the carry flag in x28 bit zero (it does this because it needs to emulate Intel/Motorola-style “arithmetic borrow” to satisfy UML)
  • Loads the NZCV system register into a GPR
  • Inserts the complement of the UML carry flag from x28 bit zero into it
  • Stores the updated value (actually unchanged) into the NZCV system register
  • Does a conditional branch
  • Then does the same thing again

Now this is pretty bad for performance. The msr is likely to cause a pipeline stall on the next conditional branch that depends on the condition codes (as opposed to conditional branches that depend on GPRs). Even if it doesn’t cause a pipeline stall, the only thing between the cset and the msr can be reordered or parallelised is executing/retiring the mrs early. That means the branch will be liable for a five-cycle penalty on misprediction. This really takes the “fast” out of “fastram”.

You see similar patterns in other places, for example here’s code generated by the Hyperstone E1 CPU core for setting a global register where it’s checking whether it needs to deal with an extended register or the status register:

    cmp w9, 0x10                            ; 3F410071
    cset x12, 5                             ; EC279F9A
    bfi x28, x12, 0, 1                      ; 9C0140B3
    mrs x12, 0xDA10                         ; 0C423BD5
    bfi x12, x28, 0x1D, 1                   ; 8C0363B3
    eor x12, x12, 0x20000000                ; 8C0163D2
    msr 0xDA10, x12                         ; 0C421BD5
    b.hs PC$4                               ; 02000054
    mov w9, w23                             ; E903172A
    cmp w9, 1                               ; 3F050071
    cset x12, 5                             ; EC279F9A
    bfi x28, x12, 0, 1                      ; 9C0140B3
    mrs x12, 0xDA10                         ; 0C423BD5
    bfi x12, x28, 0x1D, 1                   ; 8C0363B3
    eor x12, x12, 0x20000000                ; 8C0163D2
    msr 0xDA10, x12                         ; 0C421BD5
    b.hi PC$5                               ; 08000054
    b.eq PC$6                               ; 00000054
    mov w10, 0xFFFFFFFFFFFFFFFE             ; 2A008012
    ands w11, w24, w10                      ; 0B030A6A
    stur w11, [x27, 0xFFFFFFFFFFFFFF00]     ; 6B0310B8
    b PC$7                                  ; 00000014

So this pattern is happening pretty frequently, and often multiple times in succession.

Now it should be possible to avoid this:

  • Most of the time, conditional branches that depend on the carry flag immediately follow a CMP. If they follow a CMP (or SUB/SUBB) with no intervening operations that affect flags or labels, the native carry flag will already be in the desired state, so there’s no need to mess with it.
  • If a branch with the C or NC condition follows an ADD/ADDC with no intervening operations that affect flags or labels, it’s still possible to translate to a native conditional branch just by inverting the condition.
  • Even if the state of the native carry flag doesn’t correspond to the UML C flag, you can implement the C and NC conditions in terms of the UML C flag stored in x28 and avoid the expensive system register modification.

This PR adds basic inter-instruction optimisation to deal with this. The state of the native carry flag as it corresponds (or doesn’t correspond) to the UML C flag is tracked, and the strategy for conditional operations is chosen accordingly.

There are also some other fixes, optimisations an adjustments:

  • Code for emitting conditional skip setup has been refactored into a single function to simplify maintenance. The code generated for U/NU conditions has been slightly optimised.
  • Using conditional select instructions (csel/csinc/csinv/fcssel) to implement conditional forms of MOV and FMOV when advantageous.
  • Some situations where a conditional EXIT would misbehave have been fixed (we never hit these because no CPU front-end uses conditional EXIT anyway).
  • Range for conditional EXIT has been extended to ±128 MiB (it was ±1 MiB for most conditions, which is a bit short). It would be possible to make it a bit more efficient by checking the actual distance to the EXIT landing pad, that would mean it couldn’t use the common skip setup function, and I really doubt it would be worthwhile, even if front-ends actually did use conditional EXIT.
  • Slight optimisation saving one instruction for some short backward conditional jumps.
  • Some more simplification for the ADD/ADDC/SUB/SUBB code generation, avoiding an unnecessary move in some cases and reducing the number of temporary registers used in others.

It’s entirely possible I’ve missed something.

Can someone with some kind of ARM Cortex-A system set up for testing MAME (maybe @danmons or @grant2258 if you have some time) please check some things for me?

  • Check that fiveside (PowerPC 403), coolmini (E1-16) and soldivid (SH-2) are still working.
  • Get before/after scores for:
    • -bench 90 fiveside
    • -bench 90 coolmini
    • -bench 90 soldivid
  • Get me a copy of the logged generated code for:
    • -drc_log_native -bench 1 fiveside
    • -drc_log_native -bench 1 coolmini
    • -drc_log_native -bench 1 soldivid

If stuff’s broken, I’ll do my best to work with people to get it sorted out. I really don’t like leaving performance on the table.

cuavas added 3 commits March 16, 2025 15:54
Track the state of the native carry flag to avoid unnecessarily
manipulating the native NZCV register.  If the native carry flag does
not correspond to the UML carry flag, test the bit in the flags register
for the C and NC conditions.

Fixed EXIT with C/NC/A/BE condition not working properly if it doesn't
immediately follow a CMP or SUB.  Extended reach of conditional EXIT to
+/-128MiB (was +/-1MiB for conditions other than U/NU).

Moved code to set up skipping conditional instructions to a common
function.

Use TBZ/TBNZ for short backward jumps with U/NU/C/NC conditions to save
one instruction and a temporary register.
Avoids an unnecessary register copy when one operand is in memory and
the other is a small immediate value.  Also fixed another unnecessary
register copy for SUB[B] when an operand is kept in a host register.
Conditional forms of MOV and FMOV can be efficiently implemented using
conditional select instructions when the condition is handled using
native NZCV flags, the destination UML register is kept in a host
register, and the source is kept in a host register or is a simple
immediate value.

Immediate integer -1/0/1 can be generated directly using
CSINV/CSEL/CSINC with xzr/wzr as a source operand.
@grant2258
Copy link

Ill get the lastest master and add this patch tonight before bed just a few things to do atm will get it done though.

@cuavas
Copy link
Member Author

cuavas commented Mar 16, 2025

No problem. There’s no deadline. If I’ve screwed something up, it might take a few iterations to sort out.

@grant2258
Copy link

grant2258 commented Mar 16, 2025

Luckily it was a quick compile

./mamemaster -bench 90 fiveside Average speed: 583.06% (89 seconds)
./mame_13484 -bench 90 fiveside Average speed: 625.84% (89 seconds)

./mamemaster -bench 90 coolmini Average speed: 197.77% (89 seconds)
/mame_13484  -bench 90 coolmini Average speed: 209.67% (89 seconds)

./mamemaster -bench 90 soldivid Average speed: 687.40% (89 seconds)
./mame_13484 -bench 90 soldivid Average speed: 886.93% (89 seconds)

drc_13484.zip

@cuavas
Copy link
Member Author

cuavas commented Mar 16, 2025

Thanks.

Luckily it was a quick compile

Yeah, I’ve been working on keeping dependencies under control – this patch only requires one file to be recompiled.

So that’s a 7% improvement for fiveside, 6% for coolmini, and 29% for soldivid. That’s decent I guess.

Did you play all three for a bit as a sanity check?

@grant2258
Copy link

grant2258 commented Mar 16, 2025

was over ssh ill play each for a bit and get back to you if i see any issues

@cuavas
Copy link
Member Author

cuavas commented Mar 16, 2025

Here we go, here’s fiveside doing its “fastram” checks:

    and w19, w19, 0x7FFFFFFF                ; 737A0012                | and     i0,i0,$7FFFFFFF
    nop                                     ; 1F2003D5                | nop
    mov w10, 0xFF01FFFF                     ; CA1FA012                | cmp     i0,$FF01FFFF,ZC
    cmp w19, w10                            ; 7F020A6B
    cset x12, 5                             ; EC279F9A
    bfi x28, x12, 0, 1                      ; 9C0140B3
    b.hi PC$1                               ; 08000054                | jmp     $       1,a
    mov w10, 0xFF000000                     ; 0AE0BF52                | cmp     i0,$FF000000,C
    cmp w19, w10                            ; 7F020A6B
    cset x12, 5                             ; EC279F9A
    bfi x28, x12, 0, 1                      ; 9C0140B3
    b.lo PC$1                               ; 03000054                | jmp     $       1,c
    nop                                     ; 1F2003D5                | nop
    mov x9, 0x5555647C5920                  ; 09248BD2898FACF2A9AACAF2| load    i0,[[$0x5555647c5920]],i0,dword_x1
    ldr w19, [x9, w19]                      ; 336973B8
    ldp x29, x30, [sp], 0x10                ; FD7BC1A8                | ret
    ret x30                                 ; C0035FD6

And this is the equivalent snippet for the E1-16 global register write, courtesy of coolmini:

    cmp w23, 0x10                           ; FF420071                | cmp     i4,$10,C
    cset x12, 5                             ; EC279F9A
    bfi x28, x12, 0, 1                      ; 9C0140B3
    b.hs PC$4                               ; 02000054                | jmp     $       4,nc
    cmp w23, 1                              ; FF060071                | cmp     i4,$1,ZC
    cset x12, 5                             ; EC279F9A
    bfi x28, x12, 0, 1                      ; 9C0140B3
    b.hi PC$5                               ; 08000054                | jmp     $       5,a
    b.eq PC$6                               ; 00000054                | jmp     $       6,z
    mov w10, 0xFFFFFFFFFFFFFFFE             ; 2A008012                | and     [pc],i5,$FFFFFFFE
    and w11, w24, w10                       ; 0B030A0A
    stur w11, [x27, 0xFFFFFFFFFFFFFF00]     ; 6B0310B8
    b PC$7                                  ; 00000014                | jmp     $       7

(The UML in comments to help you follow along was added in #13472.)

That’s much prettier than it was before.

@rb6502
Copy link
Contributor

rb6502 commented Mar 16, 2025

For something meatier, Toy Fighter (Naomi) went from 170% to 196% on the M3 Max with this change. It now can run multi-second sequences of the attract mode without dropping a frame, which is awesome.

@cuavas
Copy link
Member Author

cuavas commented Mar 16, 2025

Yay!

Now if I could just work out where that sub-optimal AND code is coming from, maybe I could sleep in peace…

@grant2258
Copy link

grant2258 commented Mar 16, 2025

fiveside and soldivid are fine ill be playing more of soldivid is a really decent game.

coolmini doesn't start after selecting tnt game says ready then freezes this is true from mame0275, current master and this pull request it does work with -nodrc

@rb6502
Copy link
Contributor

rb6502 commented Mar 16, 2025

The coolmini tnt minigame plays OK for me.

@cuavas
Copy link
Member Author

cuavas commented Mar 16, 2025

Hmm, if it’s not a regression from this PR, it shouldn’t hold this up, but it will need to be tracked down later. It’s a bit concerning that it’s working on one ARM system and not another, though.

I think the issue causing the sub-optimal AND code there is:

  • The front-end does UML_AND(block, DRC_PC, I5, ~1);
  • The immediate ~1 gets extended out to 64 bits, i.e. 0xfffffffffffffffe
  • The back-end goes, “that’s not a valid 32-bit mask immediate, I can’t use the immediate form of and/ands

To solve this, something needs to mask 32-bit UML instructions’ immediate operands to 32-bits. I’m not going to try tackling that tonight.

I still noticed some other fuckery I can do to slightly optimise AND generateion though.

Also put my name on the files - I think I've contributed enough to it
now.
@grant2258
Copy link

grant2258 commented Mar 16, 2025

Hmm, if it’s not a regression from this PR, it shouldn’t hold this up, but it will need to be tracked down later. It’s a bit concerning that it’s working on one ARM system and not another, though.

I cleared the cfg, hiscore and nvram just to be sure it wasnt an issue in there its still happens on the pi5 hopefully more people can test. Seems to be working after 1c86e1a for the pi 5

@cuavas cuavas merged commit d7d7a4c into mamedev:master Mar 16, 2025
5 checks passed
@cuavas cuavas deleted the a64carry branch March 16, 2025 18:02
@cuavas
Copy link
Member Author

cuavas commented Mar 16, 2025

Ah, dammit. I just realised it’s possible to optimise AND <reg>,<mem>,<imm> a bit more. Oh well, not tonight.

@cuavas
Copy link
Member Author

cuavas commented Mar 16, 2025

Here’s an example of conditional select instructions now being used in PowerPC exception handling:

    mov w22, 0x500                          ; 16A08052                | mov     i3,$500
    ldr w9, [x27, 0x1054]                   ; 695750B9                | test    [spr+$F60],$8000000,Z
    tst w9, 0x8000000                       ; 3F010572
    mov w10, 0x1000                         ; 0A008252                | mov     i3,$1000,nz
    csel w22, w10, w22, 3                   ; 5611961A
    ldr w9, [x27, 0x1054]                   ; 695750B9                | test    [spr+$F60],$4000000,Z
    tst w9, 0x4000000                       ; 3F010672
    mov w10, 0x1010                         ; 0A028252                | mov     i3,$1010,nz
    csel w22, w10, w22, 3                   ; 5611961A
    ldr w9, [x27, 0x1F4]                    ; 69F741B9                | test    [spr+$100],[spr+$108],Z
    ldr w10, [x27, 0x1FC]                   ; 6AFF41B9
    tst w9, w10                             ; 3F010A6A
    mov w10, 0x500                          ; 0AA08052                | mov     i3,$500,nz
    csel w22, w10, w22, 3                   ; 5611961A

Previously it looked like:

    mov w22, 0x500                          ; 16A08052
    ldr w9, [x27, 0x1054]                   ; 695750B9
    tst w9, 0x8000000                       ; 3F010572
    b.eq L2                                 ; 00000054
    mov w22, 0x1000                         ; 16008252
L2:
    ldr w9, [x27, 0x1054]                   ; 695750B9
    tst w9, 0x4000000                       ; 3F010672
    b.eq L3                                 ; 00000054
    mov w22, 0x1010                         ; 16028252
L3:
    ldr w9, [x27, 0x1F4]                    ; 69F741B9
    ldr w10, [x27, 0x1FC]                   ; 6AFF41B9
    tst w9, w10                             ; 3F010A6A
    b.eq L4                                 ; 00000054
    mov w22, 0x500                          ; 16A08052
L4:

That should be less of a drain on branch prediction resources.

And here it’s getting “clever” generating a constant 1 for a conditional MOV in Hyperstone E1 code:

    cmp w21, w20                            ; BF02146B                | cmp     i2,i1,SZVC
    cset x12, 5                             ; EC279F9A
    bfi x28, x12, 0, 1                      ; 9C0140B3
    mov w10, 2                              ; 4A008052                | mov     i3,$2,z
    csel w22, w10, w22, 2                   ; 5601961A
    csinc w22, w22, wzr, 4                  ; D6269F1A                | mov     i3,$1,c
    b.ge PC$4                               ; 0A000054                | jmp     $       4,ge
    orr w22, w22, 4                         ; D6021E32                | or      i3,i3,$4
PC$4:                                       ;                         | label   $       4

@grant2258
Copy link

Just want to clarify it is crashing even after todays latest commit. The other day I didnt skip the how to play screen and and used the keys and assumed it was working because the keys where taking input. Its the screen right after the how to play crashes when it says ready. I did record a video but there is a 25mb limit on here so cant post it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants