some performance investigation #5

jnk0le · 2023-05-21T17:42:48Z

regarding your struggle with uncompressed instructions, I did some simple tests with 10 straightlined instructions:
(I'll put template here once ready)

10x c.nop

0ws: 10
1ws: 10
2ws (invalid in RM): 21

10x big nop

0ws: 10
1ws: 20
2ws (invalid in RM): 41

c.nop + 9x nop

0ws: 11
1ws: 20
2ws (invalid in RM): 41

2x c.nop + 8x nop
0ws: 10
1ws: 18
2ws (invalid in RM): 37

3x c.nop + 7x nop

0ws: 11
1ws: 18
2ws (invalid in RM): 37

5x c.nop + 5x nop

0ws: 10
1ws: 16
2ws (invalid in RM): 33

c.nop + 8x nop + c.nop

0ws: 11
1ws: 18
2ws (invalid in RM): 37

repeating 1x nop then 2x c.nop (10 insn total)

0ws: 10
1ws: 14
2ws (invalid in RM): 29

repeating c.nop, nop (10 insn total)
0ws: 10
1ws: 16
2ws (invalid in RM): 33

10x c.lw (or c.sw) from sram
0ws: 20
1ws: 20
2ws (invalid in RM): 21

10x lw (or sw) from sram
0ws: 20
1ws: 20
2ws (invalid in RM): 41

unaligned lw/sw causes unaligned load/store exception
word unaligned lb/sb doesn't add penalty cycles

	6x c.nop +
	c.slli a2, 1
	c.or a2, a0
	c.addi s1, -1 // # of bits left.
	andi a4, s1, 31 // mask off so we only look at bottom 7 bits

0ws: 11
1ws: 12
2ws (invalid in RM): 25

	5x c.nop +
	c.slli a2, 1
	c.or a2, a0
	c.addi s1, -1 // # of bits left.
	andi a4, s1, 31 // mask off so we only look at bottom 7 bits
	c.nop

0ws: 10
1ws: 12
2ws (invalid in RM): 25

looks like flash prefetching works, 4byte lines.

Note that you are using 48Mhz with 1ws config. you can put code in sram (e.g. .section .data.yourfunc, "x") for 0ws but here comes in the potential contention with DMA

The text was updated successfully, but these errors were encountered:

jnk0le · 2023-05-21T17:46:05Z

BTW you were complaining about .align dumping to much padding.

https://ftp.gnu.org/old-gnu/Manuals/gas-2.9.1/html_node/as_68.html

For other systems, including the i386 using a.out format, it is the number of low-order zero bits the location counter must have after advancement. For example `.align 3' advances the location counter until it a multiple of 8. If the location counter is already a multiple of 8, no change is needed.

need to use .balign if specifying byte alignment

jnk0le · 2023-05-21T18:09:56Z

BTW2

W/O HPE: 444ns W/ HPE: 589ns

"w/ HPE" option could go down by 140 ns (100ns in stream 2 case) as the stacking is not necessary (except s0, s1)

EDIT: note that irq code in sram might have some penalty for HPE.
You can also use table free interrupts to get (probably) HPE-less case a bit down

cnlohr · 2023-05-22T00:34:55Z

Thank you for the clarification of the .balign thing. I give that a shot in my next livestream.

Perhaps if we make a new, extra docs folder what you've found, but in a more publicly readable and absorb-able format, i.e. with markdown table, etc.

Also, do you see any interesting stuff surrounding word alignment in your tests? I just regret that I have a hard time absorbing the information above to obtain the deeper understanding of what's really going on inside the chip. I am also going to send this to Macyler who will likely be doing other testing.

jnk0le · 2023-05-22T08:19:43Z

so far:

loads stores are 2 cycle

flash with 4 byte lines and linear prefetch working (could be in core or in flash like cm0 are doing, i'll check later)
therefore only 16bit (single-cycle) instructions can execute at full speed

unaligned long instructions seem to (sometimes) have initial one cycle penalty and then execute normally

taken (to aligned location) branch is 3 cycles at 0ws and 5 cycles at 1 ws: 1 extra cycle for finishing prefetch of next instruction and another when waiting for target location

jnk0le · 2023-05-27T19:39:54Z

compressed branching:

0ws is always 3 cycle
1ws:
- branch from an earlier op is 4 cycles (and +1 due to only 1 insn in unaligned location)
- branch from later op is 5 cycles (and +1 due to only 1 insn in unaligned location)

Seems that the linear prefetch is triggered when 2nd instruction in bundle gets executed.
Branch also doesn't care about already prefetched instructions (no benefit in short forward branches)
backward branchng behaves exactly the same as forward

//compressed baseline:

	FLASH->ACTLR = FLASH_ACTLR_LATENCY_0;
	printf("0ws: %lu\n", ch32v_pipetest_tmpl());
	FLASH->ACTLR = FLASH_ACTLR_LATENCY_1;
	printf("1ws: %lu\n", ch32v_pipetest_tmpl()+2);

0ws: 20000
1ws: 22000

//1st op no skip:
0ws: 22000 //3
1ws: 26000 //5

//1st op over one:

	beqz a0, 2f
	nop

2:	nop
3:	nop

0ws: 21000 //-1 3
1ws: 24000 //-1 4

//1st op over two:
0ws: 20000 //-2 3
1ws: 24000 //-2 5

//1st op over three:
0ws: 19000 //-3 3
1ws: 22000 //-3 4

//1st op over five:
0ws: 17000 //-5 3
1ws: 20000 //-5 4

//2nd op no skip:
0ws: 22000 //3
1ws: 26000 //5

//2nd op over one:

	nop
	beqz a0, 3f

2:	nop
3:	nop

0ws: 21000 //-1 3
1ws: 26000 //-1 6

//2nd op over two:
0ws: 20000 //-2 3
1ws: 24000 //-2 5

//2nd op over three:
0ws: 19000 //-3 3
1ws: 24000 //-3 6

//2nd op over five:
0ws: 17000 //-5 3
1ws: 22000 //-5 6

//trim one nop from baseline
0ws: 19000
1ws: 20002

uncompressed

1 cycle penalty for unaligned branch.

//norvc baseline (allbig)
0ws: 20000
1ws: 38000

//norvc noskip
0ws: 22000
1ws: 40000

//norvc over one
0ws: 21000
1ws: 38000

//norvc over two
0ws: 20000
1ws: 36000

//unaligned norvc baseline (1x c.nop at beginning and end)

1: // replace code below
.option rvc
	nop
.option norvc
	nop
	[...]
	nop
.option rvc
	nop

	addi a5, a5, -1
	bnez a5, 1b // 3 cycle taken at 0ws, 5 at 1ws

0ws: 20001
1ws: 36000

//unaligned norvc noskip
0ws: 23000
1ws: 40000

//unaligned norvc over one
0ws: 22000
1ws: 38000

//unaligned norvc over two
0ws: 21000
1ws: 36000

jnk0le · 2023-05-28T22:11:58Z

2 cycle ops seem to swap the timmings of branching form earlier/later op
EDIT: long 1 cycle instructions are not experiencing this swap

//baseline
0ws: 20000
1ws: 22000

//one lw sram
0ws: 21000
1ws: 22001 //4??

//trim one nop //lw from sram
0ws: 20000
1ws: 22000 //5??

//two lw sram
0ws: 22000
1ws: 24000 //5

//trim one nop //two lw from sram
0ws: 21000
1ws: 22001 //4

//one lw from flash
0ws: 21000
1ws: 24001

//two lw from flash
0ws: 22000
1ws: 28000

//trim one nop //lw from flash
0ws: 20000
1ws: 24000

//trim one nop //two lw from flash
0ws: 21000
1ws: 26001

two loads can be either

	lw a2, 0(a1)
	lw a2, 0(a1)

	nop
	nop

or

	lw a2, 0(a1)
	nop
	
	lw a2, 0(a1)
	nop

or swapped ops in bundles - no difference

jnk0le · 2023-05-29T23:46:09Z

if prefetcher is pressured enough with long instructions after 2 cycle ones, it seems to be back to 4e/5l

	lw a2, 0(a1)
	nop

	lw a2, 0(a1)
	nop

	nop
	nop

.option norvc
	nop
//.option rvc
	nop
.option rvc

//1 lw sram, 1 big nop
0ws: 21000
1ws: 22002 //4 from earlier

//1 lw sram, 1 big nop // trim one nop
0ws: 20000
1ws: 22000 //5 from later

//2 lw sram, 1 big nop
0ws: 22000
1ws: 24000 //e???

//2 lw sram, 1 big nop // trim one nop
0ws: 21000
1ws: 22001 //l???

//1 lw sram, 2 big nop
0ws: 21000
1ws: 24000 //l

//1 lw sram, 2 big nop // trim one nop
0ws: 20000
1ws: 22002 //e

//2 lw sram, 2 big nop
0ws: 22000
1ws: 24000 //l

//2 lw sram, 2 big nop // trim one nop
0ws: 21000
1ws: 22002 //e

//2lw 4big
0ws: 22000
1ws: 26000 //l

//2lw 4big //trim one
0ws: 21000
1ws: 24002 //e

//2lw 3big
0ws: 22000
1ws: 24002 //e

//2lw 3big //trim one
0ws: 21000
1ws: 24000 //l

cnlohr · 2023-05-29T23:52:08Z

I am really sorry, with your syntax, I do not understand what you are trying to say. Please use a different syntax to describe what you are finding? I am not able to extract any info from it :(

jnk0le · 2023-05-30T08:14:06Z

those are the cycle counts, for a given scenario, in this template https://github.com/jnk0le/random/blob/master/pipeline%20cycle%20test/ch32v_pipetest_tmpl.S (1 additional cycle per loop makes 1000 cycles, loop invariant stuff can be filtered out)

for a quick summary:

At 0ws: everything is cycle perfect.
At 1ws: the prefetching is weird enough that one cannot easily predict the execution/branch timmings. Especially the branch anomalies.

cnlohr · 2023-05-30T22:07:18Z

Are you on the Discord server. This feels like a largely parallel effort to what Macyler is doing.

jnk0le · 2023-05-30T23:05:53Z

I don't have discord account, though it's possible to see those channels without creating one.

cnlohr · 2023-05-30T23:32:58Z

I am not sure. This is the specific channel https://discord.com/channels/665433554787893289/1110284149450878979

CaiB · 2023-05-30T23:52:40Z

(I'm Macyler) My project is still a mess right now and I haven't documented the results so far, but it's here: https://github.com/CaiB/CH32V003-Architecture-Exploration/tree/main

What I've found so far is that it seems like alignment makes little difference to in-order execution.
Executing the same instruction 3 and 7 times in a row:

From this it looks like you get 2 non-compressed instructions in a row, then any more will slow you down to 2 CPI. lui being a weird exception.
I haven't done any more testing in this direction, currently trying to set up opcode fuzzing to try and find some of the undocumented instructions.

jnk0le · 2023-05-31T17:49:47Z

regarding the compressed loads etc. it should be stuff from Zce v0.50 (or older) https://github.com/riscv/riscv-code-size-reduction/releases/tag/V0.50.1-TOOLCHAIN-DEV

The turnaround of spec to shipped silicon is about right in this case.

lui being a weird exception.

it could get compressed opcode (c.lui can address all registers except x0, and x2), then everything is as expected

cnlohr · 2023-05-31T18:14:19Z

Do you know that it is? Or just a guess?

Also, seeing C.NOT has me all like

jnk0le · 2023-05-31T18:45:26Z

ok, tried the 0.50 c.lb

	li a0, 0x12345678
	sw a0, 1024(gp)
	addi a1, gp, 1024

	.2byte (0x2002 | (3 << 7) | (0 << 2)) //a1 addr // 0 offset // load to s0
	.2byte (0x2002 | (3 << 7) | (1 << 2) | (1 << 11)) //a1 addr // 1 offset // load to s1

and got result of v0.70 cm.lhu, so that's definitely Zcmb. Not sure about sp ones as those were dropped from 0.70 Zce

jnk0le · 2023-05-31T18:58:43Z

v0.70 c.not -> illegal instruction
v0.50 c.not -> that's c.lbu of v0.70 Zcmb

cnlohr · 2023-05-31T19:09:18Z

You need to use mmooooorreeeee worrrrrdssssss. I don't have enough background to know what you are referring to.

What is v0.50 c.not? Are you saying that it was overridden, like the instruction as defined in v0.50 actually is c.lbu and that the processor does execute c.lbu as defined in Zcmb?

jnk0le · 2023-05-31T19:15:32Z

Are you saying that it was overridden, like the instruction as defined in v0.50 actually is c.lbu and that the processor does execute c.lbu as defined in Zcmb?

yes, it has the same encoding of already implemented Zcmb instructions (which did load somethig into a5)

jnk0le · 2023-06-01T20:25:14Z

wait, that's not zcmb, bit 12 is part of an offset.

there is also no c.lb and ~~c.sb~~ c.sh in their "documentation" of xw extension.

E: (c.lbu and c.lhu are there)

cnlohr · 2023-06-01T21:43:07Z

wait, that's not zcmb, bit 12 is part of an offset.

there is also no c.lb and c.sb in their "documentation" of xw extension.

Wait really?!? Bleh, I feel like I need a guide for all of the opcodes that can be used.

I really want to enable bit timing correction.

duk-37 mentioned this issue May 23, 2023

Better instruction sequence for CRC calculations #7

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

some performance investigation #5

some performance investigation #5

jnk0le commented May 21, 2023 •

edited

Loading

jnk0le commented May 21, 2023 •

edited

Loading

jnk0le commented May 21, 2023 •

edited

Loading

cnlohr commented May 22, 2023

jnk0le commented May 22, 2023 •

edited

Loading

jnk0le commented May 27, 2023

jnk0le commented May 28, 2023 •

edited

Loading

jnk0le commented May 29, 2023

cnlohr commented May 29, 2023

jnk0le commented May 30, 2023

cnlohr commented May 30, 2023

jnk0le commented May 30, 2023

cnlohr commented May 30, 2023

CaiB commented May 30, 2023

jnk0le commented May 31, 2023 •

edited

Loading

cnlohr commented May 31, 2023

jnk0le commented May 31, 2023 •

edited

Loading

jnk0le commented May 31, 2023

cnlohr commented May 31, 2023

jnk0le commented May 31, 2023

jnk0le commented Jun 1, 2023 •

edited

Loading

cnlohr commented Jun 1, 2023

some performance investigation #5

some performance investigation #5

Comments

jnk0le commented May 21, 2023 • edited Loading

jnk0le commented May 21, 2023 • edited Loading

jnk0le commented May 21, 2023 • edited Loading

cnlohr commented May 22, 2023

jnk0le commented May 22, 2023 • edited Loading

jnk0le commented May 27, 2023

compressed branching:

uncompressed

jnk0le commented May 28, 2023 • edited Loading

jnk0le commented May 29, 2023

cnlohr commented May 29, 2023

jnk0le commented May 30, 2023

cnlohr commented May 30, 2023

jnk0le commented May 30, 2023

cnlohr commented May 30, 2023

CaiB commented May 30, 2023

jnk0le commented May 31, 2023 • edited Loading

cnlohr commented May 31, 2023

jnk0le commented May 31, 2023 • edited Loading

jnk0le commented May 31, 2023

cnlohr commented May 31, 2023

jnk0le commented May 31, 2023

jnk0le commented Jun 1, 2023 • edited Loading

cnlohr commented Jun 1, 2023

jnk0le commented May 21, 2023 •

edited

Loading

jnk0le commented May 21, 2023 •

edited

Loading

jnk0le commented May 21, 2023 •

edited

Loading

jnk0le commented May 22, 2023 •

edited

Loading

jnk0le commented May 28, 2023 •

edited

Loading

jnk0le commented May 31, 2023 •

edited

Loading

jnk0le commented May 31, 2023 •

edited

Loading

jnk0le commented Jun 1, 2023 •

edited

Loading