Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

some performance investigation #5

Open
jnk0le opened this issue May 21, 2023 · 21 comments
Open

some performance investigation #5

jnk0le opened this issue May 21, 2023 · 21 comments

Comments

@jnk0le
Copy link

jnk0le commented May 21, 2023

regarding your struggle with uncompressed instructions, I did some simple tests with 10 straightlined instructions:
(I'll put template here once ready)

10x c.nop

0ws: 10
1ws: 10
2ws (invalid in RM): 21

10x big nop

0ws: 10
1ws: 20
2ws (invalid in RM): 41

c.nop + 9x nop

0ws: 11
1ws: 20
2ws (invalid in RM): 41

2x c.nop + 8x nop
0ws: 10
1ws: 18
2ws (invalid in RM): 37

3x c.nop + 7x nop

0ws: 11
1ws: 18
2ws (invalid in RM): 37

5x c.nop + 5x nop

0ws: 10
1ws: 16
2ws (invalid in RM): 33

c.nop + 8x nop + c.nop

0ws: 11
1ws: 18
2ws (invalid in RM): 37

repeating 1x nop then 2x c.nop (10 insn total)

0ws: 10
1ws: 14
2ws (invalid in RM): 29

repeating c.nop, nop (10 insn total)
0ws: 10
1ws: 16
2ws (invalid in RM): 33

10x c.lw (or c.sw) from sram
0ws: 20
1ws: 20
2ws (invalid in RM): 21

10x lw (or sw) from sram
0ws: 20
1ws: 20
2ws (invalid in RM): 41

unaligned lw/sw causes unaligned load/store exception
word unaligned lb/sb doesn't add penalty cycles

	6x c.nop +
	c.slli a2, 1
	c.or a2, a0
	c.addi s1, -1 // # of bits left.
	andi a4, s1, 31 // mask off so we only look at bottom 7 bits

0ws: 11
1ws: 12
2ws (invalid in RM): 25

	5x c.nop +
	c.slli a2, 1
	c.or a2, a0
	c.addi s1, -1 // # of bits left.
	andi a4, s1, 31 // mask off so we only look at bottom 7 bits
	c.nop

0ws: 10
1ws: 12
2ws (invalid in RM): 25

looks like flash prefetching works, 4byte lines.

Note that you are using 48Mhz with 1ws config. you can put code in sram (e.g. .section .data.yourfunc, "x") for 0ws but here comes in the potential contention with DMA

@jnk0le
Copy link
Author

jnk0le commented May 21, 2023

BTW you were complaining about .align dumping to much padding.

https://ftp.gnu.org/old-gnu/Manuals/gas-2.9.1/html_node/as_68.html

For other systems, including the i386 using a.out format, it is the number of low-order zero bits the location counter must have after advancement. For example `.align 3' advances the location counter until it a multiple of 8. If the location counter is already a multiple of 8, no change is needed.

need to use .balign if specifying byte alignment

@jnk0le
Copy link
Author

jnk0le commented May 21, 2023

BTW2

W/O HPE: 444ns W/ HPE: 589ns

"w/ HPE" option could go down by 140 ns (100ns in stream 2 case) as the stacking is not necessary (except s0, s1)

EDIT: note that irq code in sram might have some penalty for HPE.
You can also use table free interrupts to get (probably) HPE-less case a bit down

@cnlohr
Copy link
Owner

cnlohr commented May 22, 2023

Thank you for the clarification of the .balign thing. I give that a shot in my next livestream.

Perhaps if we make a new, extra docs folder what you've found, but in a more publicly readable and absorb-able format, i.e. with markdown table, etc.

Also, do you see any interesting stuff surrounding word alignment in your tests? I just regret that I have a hard time absorbing the information above to obtain the deeper understanding of what's really going on inside the chip. I am also going to send this to Macyler who will likely be doing other testing.

@jnk0le
Copy link
Author

jnk0le commented May 22, 2023

so far:

loads stores are 2 cycle

flash with 4 byte lines and linear prefetch working (could be in core or in flash like cm0 are doing, i'll check later)
therefore only 16bit (single-cycle) instructions can execute at full speed

unaligned long instructions seem to (sometimes) have initial one cycle penalty and then execute normally

taken (to aligned location) branch is 3 cycles at 0ws and 5 cycles at 1 ws: 1 extra cycle for finishing prefetch of next instruction and another when waiting for target location

@jnk0le
Copy link
Author

jnk0le commented May 27, 2023

compressed branching:

  • 0ws is always 3 cycle
  • 1ws:
    • branch from an earlier op is 4 cycles (and +1 due to only 1 insn in unaligned location)
    • branch from later op is 5 cycles (and +1 due to only 1 insn in unaligned location)

Seems that the linear prefetch is triggered when 2nd instruction in bundle gets executed.
Branch also doesn't care about already prefetched instructions (no benefit in short forward branches)
backward branchng behaves exactly the same as forward

//compressed baseline:

	FLASH->ACTLR = FLASH_ACTLR_LATENCY_0;
	printf("0ws: %lu\n", ch32v_pipetest_tmpl());
	FLASH->ACTLR = FLASH_ACTLR_LATENCY_1;
	printf("1ws: %lu\n", ch32v_pipetest_tmpl()+2);

0ws: 20000
1ws: 22000

//1st op no skip:
0ws: 22000 //3
1ws: 26000 //5

//1st op over one:

	beqz a0, 2f
	nop

2:	nop
3:	nop

0ws: 21000 //-1 3
1ws: 24000 //-1 4

//1st op over two:
0ws: 20000 //-2 3
1ws: 24000 //-2 5

//1st op over three:
0ws: 19000 //-3 3
1ws: 22000 //-3 4

//1st op over five:
0ws: 17000 //-5 3
1ws: 20000 //-5 4

//2nd op no skip:
0ws: 22000 //3
1ws: 26000 //5

//2nd op over one:

	nop
	beqz a0, 3f

2:	nop
3:	nop

0ws: 21000 //-1 3
1ws: 26000 //-1 6

//2nd op over two:
0ws: 20000 //-2 3
1ws: 24000 //-2 5

//2nd op over three:
0ws: 19000 //-3 3
1ws: 24000 //-3 6

//2nd op over five:
0ws: 17000 //-5 3
1ws: 22000 //-5 6

//trim one nop from baseline
0ws: 19000
1ws: 20002

uncompressed

1 cycle penalty for unaligned branch.

//norvc baseline (allbig)
0ws: 20000
1ws: 38000

//norvc noskip
0ws: 22000
1ws: 40000

//norvc over one
0ws: 21000
1ws: 38000

//norvc over two
0ws: 20000
1ws: 36000

//unaligned norvc baseline (1x c.nop at beginning and end)

1: // replace code below
.option rvc
	nop
.option norvc
	nop
	[...]
	nop
.option rvc
	nop

	addi a5, a5, -1
	bnez a5, 1b // 3 cycle taken at 0ws, 5 at 1ws

0ws: 20001
1ws: 36000

//unaligned norvc noskip
0ws: 23000
1ws: 40000

//unaligned norvc over one
0ws: 22000
1ws: 38000

//unaligned norvc over two
0ws: 21000
1ws: 36000

@jnk0le
Copy link
Author

jnk0le commented May 28, 2023

2 cycle ops seem to swap the timmings of branching form earlier/later op
EDIT: long 1 cycle instructions are not experiencing this swap

//baseline
0ws: 20000
1ws: 22000

//one lw sram
0ws: 21000
1ws: 22001 //4??

//trim one nop //lw from sram
0ws: 20000
1ws: 22000 //5??

//two lw sram
0ws: 22000
1ws: 24000 //5

//trim one nop //two lw from sram
0ws: 21000
1ws: 22001 //4

//one lw from flash
0ws: 21000
1ws: 24001

//two lw from flash
0ws: 22000
1ws: 28000

//trim one nop //lw from flash
0ws: 20000
1ws: 24000

//trim one nop //two lw from flash
0ws: 21000
1ws: 26001

two loads can be either

	lw a2, 0(a1)
	lw a2, 0(a1)

	nop
	nop

or

	lw a2, 0(a1)
	nop
	
	lw a2, 0(a1)
	nop

or swapped ops in bundles - no difference

@jnk0le
Copy link
Author

jnk0le commented May 29, 2023

if prefetcher is pressured enough with long instructions after 2 cycle ones, it seems to be back to 4e/5l

	lw a2, 0(a1)
	nop

	lw a2, 0(a1)
	nop

	nop
	nop

.option norvc
	nop
//.option rvc
	nop
.option rvc

//1 lw sram, 1 big nop
0ws: 21000
1ws: 22002 //4 from earlier

//1 lw sram, 1 big nop // trim one nop
0ws: 20000
1ws: 22000 //5 from later

//2 lw sram, 1 big nop
0ws: 22000
1ws: 24000 //e???

//2 lw sram, 1 big nop // trim one nop
0ws: 21000
1ws: 22001 //l???

//1 lw sram, 2 big nop
0ws: 21000
1ws: 24000 //l

//1 lw sram, 2 big nop // trim one nop
0ws: 20000
1ws: 22002 //e

//2 lw sram, 2 big nop
0ws: 22000
1ws: 24000 //l

//2 lw sram, 2 big nop // trim one nop
0ws: 21000
1ws: 22002 //e

//2lw 4big
0ws: 22000
1ws: 26000 //l

//2lw 4big //trim one
0ws: 21000
1ws: 24002 //e

//2lw 3big
0ws: 22000
1ws: 24002 //e

//2lw 3big //trim one
0ws: 21000
1ws: 24000 //l

@cnlohr
Copy link
Owner

cnlohr commented May 29, 2023

I am really sorry, with your syntax, I do not understand what you are trying to say. Please use a different syntax to describe what you are finding? I am not able to extract any info from it :(

@jnk0le
Copy link
Author

jnk0le commented May 30, 2023

those are the cycle counts, for a given scenario, in this template https://github.com/jnk0le/random/blob/master/pipeline%20cycle%20test/ch32v_pipetest_tmpl.S (1 additional cycle per loop makes 1000 cycles, loop invariant stuff can be filtered out)

for a quick summary:

  • At 0ws: everything is cycle perfect.
  • At 1ws: the prefetching is weird enough that one cannot easily predict the execution/branch timmings. Especially the branch anomalies.

@cnlohr
Copy link
Owner

cnlohr commented May 30, 2023

Are you on the Discord server. This feels like a largely parallel effort to what Macyler is doing.

@jnk0le
Copy link
Author

jnk0le commented May 30, 2023

I don't have discord account, though it's possible to see those channels without creating one.

@cnlohr
Copy link
Owner

cnlohr commented May 30, 2023

I am not sure. This is the specific channel https://discord.com/channels/665433554787893289/1110284149450878979

@CaiB
Copy link

CaiB commented May 30, 2023

(I'm Macyler) My project is still a mess right now and I haven't documented the results so far, but it's here: https://github.com/CaiB/CH32V003-Architecture-Exploration/tree/main

What I've found so far is that it seems like alignment makes little difference to in-order execution.
Executing the same instruction 3 and 7 times in a row:
image
From this it looks like you get 2 non-compressed instructions in a row, then any more will slow you down to 2 CPI. lui being a weird exception.
I haven't done any more testing in this direction, currently trying to set up opcode fuzzing to try and find some of the undocumented instructions.

@jnk0le
Copy link
Author

jnk0le commented May 31, 2023

regarding the compressed loads etc. it should be stuff from Zce v0.50 (or older) https://github.com/riscv/riscv-code-size-reduction/releases/tag/V0.50.1-TOOLCHAIN-DEV

The turnaround of spec to shipped silicon is about right in this case.

lui being a weird exception.

it could get compressed opcode (c.lui can address all registers except x0, and x2), then everything is as expected

@cnlohr
Copy link
Owner

cnlohr commented May 31, 2023

Do you know that it is? Or just a guess?

Also, seeing C.NOT has me all like
image

@jnk0le
Copy link
Author

jnk0le commented May 31, 2023

ok, tried the 0.50 c.lb

	li a0, 0x12345678
	sw a0, 1024(gp)
	addi a1, gp, 1024

	.2byte (0x2002 | (3 << 7) | (0 << 2)) //a1 addr // 0 offset // load to s0
	.2byte (0x2002 | (3 << 7) | (1 << 2) | (1 << 11)) //a1 addr // 1 offset // load to s1

and got result of v0.70 cm.lhu, so that's definitely Zcmb. Not sure about sp ones as those were dropped from 0.70 Zce
obraz

@jnk0le
Copy link
Author

jnk0le commented May 31, 2023

v0.70 c.not -> illegal instruction
v0.50 c.not -> that's c.lbu of v0.70 Zcmb

@cnlohr
Copy link
Owner

cnlohr commented May 31, 2023

You need to use mmooooorreeeee worrrrrdssssss. I don't have enough background to know what you are referring to.

What is v0.50 c.not? Are you saying that it was overridden, like the instruction as defined in v0.50 actually is c.lbu and that the processor does execute c.lbu as defined in Zcmb?

@jnk0le
Copy link
Author

jnk0le commented May 31, 2023

Are you saying that it was overridden, like the instruction as defined in v0.50 actually is c.lbu and that the processor does execute c.lbu as defined in Zcmb?

yes, it has the same encoding of already implemented Zcmb instructions (which did load somethig into a5)

@jnk0le
Copy link
Author

jnk0le commented Jun 1, 2023

wait, that's not zcmb, bit 12 is part of an offset.

there is also no c.lb and c.sb c.sh in their "documentation" of xw extension.

E: (c.lbu and c.lhu are there)

@cnlohr
Copy link
Owner

cnlohr commented Jun 1, 2023

wait, that's not zcmb, bit 12 is part of an offset.

there is also no c.lb and c.sb in their "documentation" of xw extension.

Wait really?!? Bleh, I feel like I need a guide for all of the opcodes that can be used.

I really want to enable bit timing correction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants