Skip to content

Commit 1ccaddc

Browse files
committed
fixes
1 parent 5e9d496 commit 1ccaddc

File tree

1 file changed

+63
-12
lines changed

1 file changed

+63
-12
lines changed

content/post/multiply.md

Lines changed: 63 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -12,10 +12,19 @@ counter as the output to, say, an XOR instruction. Or an AND instruction.
1212

1313
Or a multiply instruction.
1414

15-
The ARM7TDMI's multiplication instruction has a pretty interesting side effect. Here the manual says that
15+
The ARM7TDMI has six different multiply instructions. The type signatures are:
16+
- u32 = u32 x u32
17+
- u64 = u32 x u32
18+
- i64 = i32 x i32
19+
- u32 = u32 x u32 + u32
20+
- u64 = u32 x u32 + u64
21+
- i64 = i32 x i32 + i64
22+
23+
Why are we talking about these instructions? Well the ARM7TDMI's multiplications instruction have a pretty interesting side effect. Here the manual says that
1624
after a multiplication instruction executes, the carry and overflow flags are `UNPREDICTABLE`.
1725

1826
![An image of the ARM7TDMI manual explaining that the carry and overflow flags are `UNPREDICTABLE` after a multiply instruction.](/manual.png)
27+
<small>A short description of carry and overflow flags after a multiplication instruction from the ARM7TDMI manual. <sup>[[1](#cite1)]</sup></small>
1928

2029
As if anything else in this god forsaken CPU was predictable. What this means is that software cannot and
2130
should not rely on the value of the carry flag after multiplication executes. It can be set to anything. Any
@@ -32,6 +41,12 @@ emulate at all. Software doesn't rely on it. And if software _did_ rely on it, t
3241
developers got what was coming to them. But the carry flag is a meme, and it's a really tough puzzle, and
3342
that was motivation enough for me to give it a go. Little did I know it'd take _3 years_ of on and off work.
3443

44+
<<<<<<< HEAD
45+
=======
46+
Now is probably the time to say that this blog post assumes a base level of knowledge - comfort in the C programming language and bitwise math is recommended. Also, if you ever have any questions, any at all, while reading this blog post, feel free to reach out to me [here](
47+
https://github.com/bmchtech/blog/discussions).
48+
49+
>>>>>>> ab429a6 (fixes)
3550
# Standard Algorithm
3651
What's the simplest, most basic multiplication algorithm you can think of to multiply a <span style="color:#3a7dc9"> **multiplier**</span> with a <span style="color:#DC6A76"> **multiplicand**</span>? One really easy way is to
3752
leverage the distributive property of multiplication like so:
@@ -151,6 +166,7 @@ struct BoothRecodingOutput booth_recode(u64 input, BoothChunk booth_chunk) {
151166
}
152167
}
153168
```
169+
For the curious, more information about Booth Recoding can be found in this resource. <sup>[[2](#cite2)]</sup>
154170
155171
# How to Add Stuff ✨ Efficiently ✨
156172
Now that we have the addends, it's time to actually add them up to produce the result. However, using a
@@ -161,7 +177,7 @@ determined. Can we eliminate this issue?
161177
162178
Introducing... *drum roll*... carry save adders (CSAs)! These are genius - instead of outputting a single `N-bit` result, CSAs output one `N-bit` result without carry propagation, and one `N-bit` list of carries computed from each bit. At first this seems kind of silly - are CSAs really adding two `N-bit` operands and
163179
producing two `N-bit` results? What's the point? The point is that you can actually fit in an extra operand,
164-
and turn three `N-bit` operands into two `N-bit` results. Like so:
180+
and turn three `N-bit` operands into two `N-bit` results. <sup>[[3](#cite3)]</sup> Like so:
165181
```c
166182
struct CSAOutput {
167183
u64 output;
@@ -196,6 +212,12 @@ The reason we multiply `carries` by two is because, if we think about how a full
196212
from bit `i` is added to bits `i + 1` of the addends. So, bit `i` of carries has double the "weight" of bit `i` of
197213
result. This is a **very** important detail that will come in handy later, so do make sure you understand
198214
this.
215+
<<<<<<< HEAD
216+
=======
217+
218+
Using CSAs, the ARM7TDMI can sum up the addends together much faster. <sup>[[4, p. 94](#cite4)]</sup>
219+
220+
>>>>>>> ab429a6 (fixes)
199221
# Parallelism
200222
Until now, we've mostly treated "generate the addends" and "add the addends" as two separate, entirely
201223
discrete steps of the algorithm. But, turns out, we can do both of these steps _at the same time_. We
@@ -208,7 +230,7 @@ results back to the very top of the CSA array for the next cycle. We can initial
208230
CSA array with `0`s. Or, if we want to be clever, we can implement multiply accumulate by initializing one
209231
of those two inputs with the accumulate value, and get multiply accumulate for free. This trick is what the
210232
ARM7TDMI employs to do multiply accumulate. (This is a moot point, because the CPU is stupid and can only read two register values at a time per cycle. So, using an accumulate causes the CPU to take
211-
an extra cycle _anyway_).
233+
an extra cycle _anyway_). <sup>[[4, p.95](#cite4)]</sup>
212234
213235
214236
# Early Termination
@@ -218,16 +240,22 @@ cycles of CSA compression, where each cycle `i` processes bits `8 * i` to `8 * i
218240
zeros, then, we can skip that cycle, since the addends produced will be all zeros, which cannot possibly
219241
affect the values of the partial result + partial carry. We can do the same trick if the remaining upper bits
220242
are all ones (assuming we are performing a signed multiplication), as those also produce addends that
243+
<<<<<<< HEAD
221244
are all zeros.
245+
=======
246+
are all zeros. <sup>[[4, p.95](#cite4)]</sup>
247+
248+
>>>>>>> ab429a6 (fixes)
222249
# Putting it all together
223250
224251
Here's a rough diagram, provided by Steve Furber in his book, Arm System-On-Chip Architecture:
225252
226253
![An image of the high level overview of the multiplier's organization, provided by Steve Furber in his book, Arm System-On-Chip Architecture](/booth.png)
254+
<small> An image of the high level overview of the multiplier's organization, provided by Steve Furber in his book, Arm System-On-Chip Architecture. <sup>[[4, p.95](#cite4)]</sup> </small>
227255
228-
Partial Sum / Partial Carry contain the results obtained by the CSAs, and are rotated right by 8 on each cycle. Rm is recoded using booth's algorithm to produce the addends for the CSA array.
256+
Partial Sum / Partial Carry contain the results obtained by the CSAs, and are rotated right by 8 on each cycle. Rm is recoded using booth's algorithm to produce the addends for the CSA array. <sup>[[4, p.95](#cite4)]</sup>
229257
230-
Ok, but remember when I said (make sure I said this) that there will be an elegant way to handle booth's negation of the addends? The way the algorithm gets around this is kind of genius. Remember how the carry output of a CSA has to be left shifted by 1? Well, this left-shift creates a zero in the LSB of the carry output of the CSA, so why don't we just put the carry in that bit? Like so:
258+
Ok, but remember when I said (make sure I said this) that there will be an elegant way to handle booth's negation of the addends? The way the algorithm gets around this is kind of genius. Remember how the carry output of a CSA has to be left shifted by 1? Well, this left-shift creates a zero in the LSB of the carry output of the CSA, so why don't we just put the carry in that bit? <sup>[[5, p. 12](#cite5)]</sup> Like so:
231259
<a name="perform_csa_array"></a>
232260
233261
```c
@@ -330,22 +358,22 @@ So fast forward about a year, I'm out for a walk and I decide to give this probl
330358

331359
I mean, it's kind of dumb, right? The entire issue is that the <span style="color:#3a7dc9"> **multiplier**</span> is _too big_. Left shifting it would only exacerbate this issue. Congrats, we went from being able to process 7 bits on the first cycle to 6.
332360

333-
But pay attention to the **first addend** that would be produced. The corresponding **chunk** would either be `000` or `100`. Two options, both of which are really easy to compute. This is a **chunk** that would only exist on the first cycle of the algorithm. Coincidentally, if you refer to the diagram[have actual link or figure #] up above, you'll notice that, in the first cycle of the algorithm, we have an extra input in the CSA array that we initialized to zero. What if, instead, we initialize it to the addend produced by this mythical **chunk**?
361+
But pay attention to the **first addend** that would be produced. The corresponding **chunk** would either be `000` or `100`. Two options, both of which are really easy to compute. This is a **chunk** that would only exist on the first cycle of the algorithm. Coincidentally, if you refer to the diagram[have actual link or figure #] up above, you'll notice that, in the first cycle of the algorithm, we have an extra input in the CSA array that we initialized to zero. What if, instead, we initialize it to the addend produced by this mythical **chunk**? <sup>[[5, p. 14](#cite5)]</sup>
334362

335363
It'd solve the issue. It'd get us the extra bit we needed, and make us match the ARM7TDMI's cycle counts completely.
336364

337365
But that's not all. Remember the carry flag from earlier? With this simple change, we go from matching hardware about 50% of the time (no better than randomly guessing) to matching hardware _**85%**_ of the time. This sudden increase was something no other theory was able to do, and made me really confident that I was on to something. However, this percentage only happens if we set the carry flag to bit `30` of the partial carry result, which seems super arbitrary. It turns out that bit of the partial carry result had a special meaning I did not realize at the time, and I would only find out that meaning much, much later.
338366

339367
# Mathematical Black Magic
340368

341-
It feels like we are finally making some sort of progress, however my algorithm still failed to calculate the carry flag properly around 15% of the time, and failed way more than that on long / signed multiplies. It was around this time that I found two patents [link later] that almost _entirely_ explained the algorithm. No idea how these hadn't been found up until this point, but they were quite illuminating.
369+
It feels like we are finally making some sort of progress, however my algorithm still failed to calculate the carry flag properly around 15% of the time, and failed way more than that on long / signed multiplies. It was around this time that I found two patents, that almost _entirely_ explained the algorithm. No idea how these hadn't been found up until this point, but they were quite illuminating. <sup>[[5](#cite5)], [[6](#cite6)]</sup>
342370

343371
After reading the patents, it turns out my implementation of the CSA array is slightly flawed (see [`perform_csa_array`](#perform_csa_array) above). In particular, that function uses CSAs with a width of _64_ bits. That's way too large and wastes space on the chip - the actual hardware gets away with only using _31_.
344372

345373
Another difference is that my algorithm has no way yet of supporting long accumulate values. Sure, I can initialize the partial output with the accumulate value, but the partial output is only 32 bits wide.
346374

347375

348-
Turns out, the patents describe a way to deal with both of these issues at once, using some mathematical trickery. This is the hardest part of the algorithm, so hang in there. (cite)
376+
Turns out, the patents describe a way to deal with both of these issues at once, using some mathematical trickery. Pretty much the entire rest of this section is derived from [5, pp. 14-17]. This is the hardest part of the algorithm, so hang in there.
349377

350378
Roughly, on each CSA, we want to add three numbers together to produce two numbers. Let's give these five numbers some names. Define `S` to be a 33-bit value (even though the actual S is 32-bits, adding an extra bit allows us to handle both signed and unsigned multiplication) representing the previous CSA's sum, `C` to be a 33-bit value representing the previous CSA's carry, and `S'` and `C'` to be 33-bit values representing the resulting CSA sum / carry. Finally, define `X` to be a 34-bit value containing the current booths addend. Then we have:
351379

@@ -449,7 +477,7 @@ Meaning `C'[32] = !A[2i+35]`.
449477

450478

451479

452-
And with that, we managed to go from using 64 bits of CSA, to only 33. Our final algorithm for the CSAs is as follows:
480+
And with that, we managed to go from using 64 bits of CSA, to only 33. [5 pp. 14-17] Our final algorithm for the CSAs is as follows:
453481

454482

455483
```C
@@ -556,7 +584,7 @@ Since `partial_sum` and `partial_carry` are shift registers that get rotated wit
556584

557585
Spoiler alert, the value of the carry flag after a multiply instruction comes from the carryout of this barrel shifter.
558586

559-
So, what rotation values does the ARM7TDMI use? According to the patents, for an unsigned multiply, all (1 or 2) uses of the barrel shifter do:
587+
So, what rotation values does the ARM7TDMI use? According to one of the patents, for an unsigned multiply, all (1 or 2) uses of the barrel shifter do this. <sup>[[6, p. 9](#cite6)]</sup>
560588

561589
| # Iterations | Type | Rotation |
562590
| - | - | - |
@@ -565,7 +593,7 @@ So, what rotation values does the ARM7TDMI use? According to the patents, for an
565593
| 3 |ROR|6 |
566594
| 4 |ROR|30 |
567595

568-
Signed multiplies differ from unsigned multiplies in their **second** barrel shift. The second one for signed multiplies looks like this:
596+
Signed multiplies differ from unsigned multiplies in their **second** barrel shift. The second one for signed multiplies looks like this. <sup>[[6, p. 9](#cite6)]</sup>
569597

570598
| # Iterations | Type | Rotation |
571599
| - | - | - |
@@ -576,7 +604,7 @@ Signed multiplies differ from unsigned multiplies in their **second** barrel shi
576604

577605
I'm not going to lie, I couldn't make sense of these rotation values. At all. Maybe they were wrong, since they patents already had a couple major errors at this point. No idea. Turns out it doesn't _really_ matter for calculating the carry flag of a multiply instruction. Observe the operation of the ARM7TDMI's `ROR` and `ASR`.
578606

579-
Code from fleroviux's NanoBoyAdvance:
607+
Code from fleroviux's wonderful NanoBoyAdvance. <sup>[[7]](#cite7)</sup>
580608
```C++
581609
void ROR(u32& operand, u8 amount, int& carry, bool immediate) {
582610
// Note that in booth's algorithm, the immediate argument will be true, and
@@ -705,3 +733,26 @@ if (is_long(flavor)) {
705733
```
706734

707735
Anyway, that's basically it. If you're interested in the full code, take a look [here](https://github.com/zaydlang/multiplication-algorithm/tree/master).
736+
737+
# Works Cited
738+
739+
<a name="cite1"></a>
740+
[1] “Advanced RISC Machines ARM ARM 7TDMI Data Sheet,” 1995. Accessed: Oct. 21, 2024. [Online]. Available: https://www.dwedit.org/files/ARM7TDMI.pdf
741+
742+
<a name="cite2"></a>
743+
[2] “ASIC Design for Signal Processing,” Geoffknagge.com, 2024. https://www.geoffknagge.com/fyp/booth.shtml
744+
745+
<a name="cite3"></a>
746+
[3] Wikipedia Contributors, “Carry-save adder,” Wikipedia, Sep. 17, 2024. https://en.wikipedia.org/wiki/Carry-save_adder
747+
748+
<a name="cite4"></a>
749+
[4] Furber, Arm System-On-Chip Architecture, 2/E. Pearson Education India, 2001.
750+
751+
<a name="cite5"></a>
752+
[5] D. J. Seal, G. Larri, and D. V. Jaggar, “Data Processing Using Multiply-accumulate Instructions,” Jul. 14, 1994
753+
754+
<a name="cite6"></a>
755+
[6] G. Larri, “Data Processing Method And Apparatus Including Iterative Multiplier,” Mar. 11, 1994
756+
757+
<a name="cite7"></a>
758+
[7] fleroviux. "NanoBoyAdvance." GitHub. Available: https://github.com/nba-emu/NanoBoyAdvance.

0 commit comments

Comments
 (0)