-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
some performance investigation #5
Comments
BTW you were complaining about https://ftp.gnu.org/old-gnu/Manuals/gas-2.9.1/html_node/as_68.html
need to use |
BTW2
"w/ HPE" option could go down by 140 ns (100ns in stream 2 case) as the stacking is not necessary (except s0, s1) EDIT: note that irq code in sram might have some penalty for HPE. |
Thank you for the clarification of the Perhaps if we make a new, extra docs folder what you've found, but in a more publicly readable and absorb-able format, i.e. with markdown table, etc. Also, do you see any interesting stuff surrounding word alignment in your tests? I just regret that I have a hard time absorbing the information above to obtain the deeper understanding of what's really going on inside the chip. I am also going to send this to Macyler who will likely be doing other testing. |
so far: loads stores are 2 cycle flash with 4 byte lines and linear prefetch working (could be in core or in flash like cm0 are doing, i'll check later) unaligned long instructions seem to (sometimes) have initial one cycle penalty and then execute normally taken (to aligned location) branch is 3 cycles at 0ws and 5 cycles at 1 ws: 1 extra cycle for finishing prefetch of next instruction and another when waiting for target location |
compressed branching:
Seems that the linear prefetch is triggered when 2nd instruction in bundle gets executed. //compressed baseline:
0ws: 20000 //1st op no skip: //1st op over one:
0ws: 21000 //-1 3 //1st op over two: //1st op over three: //1st op over five: //2nd op no skip: //2nd op over one:
0ws: 21000 //-1 3 //2nd op over two: //2nd op over three: //2nd op over five: //trim one nop from baseline uncompressed1 cycle penalty for unaligned branch. //norvc baseline (allbig) //norvc noskip //norvc over one //norvc over two //unaligned norvc baseline (1x c.nop at beginning and end)
0ws: 20001 //unaligned norvc noskip //unaligned norvc over one //unaligned norvc over two |
2 cycle ops seem to swap the timmings of branching form earlier/later op //baseline //one lw sram //trim one nop //lw from sram //two lw sram //trim one nop //two lw from sram //one lw from flash //two lw from flash //trim one nop //lw from flash //trim one nop //two lw from flash two loads can be either
or
or swapped ops in bundles - no difference |
if prefetcher is pressured enough with long instructions after 2 cycle ones, it seems to be back to 4e/5l
//1 lw sram, 1 big nop //1 lw sram, 1 big nop // trim one nop //2 lw sram, 1 big nop //2 lw sram, 1 big nop // trim one nop //1 lw sram, 2 big nop //1 lw sram, 2 big nop // trim one nop //2 lw sram, 2 big nop //2 lw sram, 2 big nop // trim one nop //2lw 4big //2lw 4big //trim one //2lw 3big //2lw 3big //trim one |
I am really sorry, with your syntax, I do not understand what you are trying to say. Please use a different syntax to describe what you are finding? I am not able to extract any info from it :( |
those are the cycle counts, for a given scenario, in this template https://github.com/jnk0le/random/blob/master/pipeline%20cycle%20test/ch32v_pipetest_tmpl.S (1 additional cycle per loop makes 1000 cycles, loop invariant stuff can be filtered out) for a quick summary:
|
Are you on the Discord server. This feels like a largely parallel effort to what Macyler is doing. |
I don't have discord account, though it's possible to see those channels without creating one. |
I am not sure. This is the specific channel https://discord.com/channels/665433554787893289/1110284149450878979 |
(I'm Macyler) My project is still a mess right now and I haven't documented the results so far, but it's here: https://github.com/CaiB/CH32V003-Architecture-Exploration/tree/main What I've found so far is that it seems like alignment makes little difference to in-order execution. |
regarding the compressed loads etc. it should be stuff from Zce v0.50 (or older) https://github.com/riscv/riscv-code-size-reduction/releases/tag/V0.50.1-TOOLCHAIN-DEV The turnaround of spec to shipped silicon is about right in this case.
it could get compressed opcode ( |
v0.70 c.not -> illegal instruction |
You need to use mmooooorreeeee worrrrrdssssss. I don't have enough background to know what you are referring to. What is v0.50 c.not? Are you saying that it was overridden, like the instruction as defined in v0.50 actually is c.lbu and that the processor does execute c.lbu as defined in Zcmb? |
yes, it has the same encoding of already implemented Zcmb instructions (which did load somethig into a5) |
wait, that's not zcmb, bit 12 is part of an offset. there is also no c.lb and E: (c.lbu and c.lhu are there) |
Wait really?!? Bleh, I feel like I need a guide for all of the opcodes that can be used. I really want to enable bit timing correction. |
regarding your struggle with uncompressed instructions, I did some simple tests with 10 straightlined instructions:
(I'll put template here once ready)
10x c.nop
0ws: 10
1ws: 10
2ws (invalid in RM): 21
10x big nop
0ws: 10
1ws: 20
2ws (invalid in RM): 41
c.nop + 9x nop
0ws: 11
1ws: 20
2ws (invalid in RM): 41
2x c.nop + 8x nop
0ws: 10
1ws: 18
2ws (invalid in RM): 37
3x c.nop + 7x nop
0ws: 11
1ws: 18
2ws (invalid in RM): 37
5x c.nop + 5x nop
0ws: 10
1ws: 16
2ws (invalid in RM): 33
c.nop + 8x nop + c.nop
0ws: 11
1ws: 18
2ws (invalid in RM): 37
repeating 1x nop then 2x c.nop (10 insn total)
0ws: 10
1ws: 14
2ws (invalid in RM): 29
repeating c.nop, nop (10 insn total)
0ws: 10
1ws: 16
2ws (invalid in RM): 33
10x c.lw (or c.sw) from sram
0ws: 20
1ws: 20
2ws (invalid in RM): 21
10x lw (or sw) from sram
0ws: 20
1ws: 20
2ws (invalid in RM): 41
unaligned lw/sw causes unaligned load/store exception
word unaligned lb/sb doesn't add penalty cycles
0ws: 11
1ws: 12
2ws (invalid in RM): 25
0ws: 10
1ws: 12
2ws (invalid in RM): 25
looks like flash prefetching works, 4byte lines.
Note that you are using 48Mhz with 1ws config. you can put code in sram (e.g.
.section .data.yourfunc, "x"
) for 0ws but here comes in the potential contention with DMAThe text was updated successfully, but these errors were encountered: