You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/post/multiply.md
+63-12Lines changed: 63 additions & 12 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -12,10 +12,19 @@ counter as the output to, say, an XOR instruction. Or an AND instruction.
12
12
13
13
Or a multiply instruction.
14
14
15
-
The ARM7TDMI's multiplication instruction has a pretty interesting side effect. Here the manual says that
15
+
The ARM7TDMI has six different multiply instructions. The type signatures are:
16
+
- u32 = u32 x u32
17
+
- u64 = u32 x u32
18
+
- i64 = i32 x i32
19
+
- u32 = u32 x u32 + u32
20
+
- u64 = u32 x u32 + u64
21
+
- i64 = i32 x i32 + i64
22
+
23
+
Why are we talking about these instructions? Well the ARM7TDMI's multiplications instruction have a pretty interesting side effect. Here the manual says that
16
24
after a multiplication instruction executes, the carry and overflow flags are `UNPREDICTABLE`.
17
25
18
26

27
+
<small>A short description of carry and overflow flags after a multiplication instruction from the ARM7TDMI manual. <sup>[[1](#cite1)]</sup></small>
19
28
20
29
As if anything else in this god forsaken CPU was predictable. What this means is that software cannot and
21
30
should not rely on the value of the carry flag after multiplication executes. It can be set to anything. Any
@@ -32,6 +41,12 @@ emulate at all. Software doesn't rely on it. And if software _did_ rely on it, t
32
41
developers got what was coming to them. But the carry flag is a meme, and it's a really tough puzzle, and
33
42
that was motivation enough for me to give it a go. Little did I know it'd take _3 years_ of on and off work.
34
43
44
+
<<<<<<< HEAD
45
+
=======
46
+
Now is probably the time to say that this blog post assumes a base level of knowledge - comfort in the C programming language and bitwise math is recommended. Also, if you ever have any questions, any at all, while reading this blog post, feel free to reach out to me [here](
47
+
https://github.com/bmchtech/blog/discussions).
48
+
49
+
>>>>>>> ab429a6 (fixes)
35
50
# Standard Algorithm
36
51
What's the simplest, most basic multiplication algorithm you can think of to multiply a <spanstyle="color:#3a7dc9"> **multiplier**</span> with a <spanstyle="color:#DC6A76"> **multiplicand**</span>? One really easy way is to
37
52
leverage the distributive property of multiplication like so:
For the curious, more information about Booth Recoding can be found in this resource. <sup>[[2](#cite2)]</sup>
154
170
155
171
# How to Add Stuff ✨ Efficiently ✨
156
172
Now that we have the addends, it's time to actually add them up to produce the result. However, using a
@@ -161,7 +177,7 @@ determined. Can we eliminate this issue?
161
177
162
178
Introducing... *drum roll*... carry save adders (CSAs)! These are genius - instead of outputting a single `N-bit` result, CSAs output one `N-bit` result without carry propagation, and one `N-bit` list of carries computed from each bit. At first this seems kind of silly - are CSAs really adding two `N-bit` operands and
163
179
producing two `N-bit` results? What's the point? The point is that you can actually fit in an extra operand,
164
-
and turn three `N-bit` operands into two `N-bit` results. Like so:
180
+
and turn three `N-bit` operands into two `N-bit` results. <sup>[[3](#cite3)]</sup> Like so:
165
181
```c
166
182
struct CSAOutput {
167
183
u64 output;
@@ -196,6 +212,12 @@ The reason we multiply `carries` by two is because, if we think about how a full
196
212
from bit `i` is added to bits `i + 1` of the addends. So, bit `i` of carries has double the "weight" of bit `i` of
197
213
result. This is a **very** important detail that will come in handy later, so do make sure you understand
198
214
this.
215
+
<<<<<<< HEAD
216
+
=======
217
+
218
+
Using CSAs, the ARM7TDMI can sum up the addends together much faster. <sup>[[4, p. 94](#cite4)]</sup>
219
+
220
+
>>>>>>> ab429a6 (fixes)
199
221
# Parallelism
200
222
Until now, we've mostly treated "generate the addends" and "add the addends" as two separate, entirely
201
223
discrete steps of the algorithm. But, turns out, we can do both of these steps _at the same time_. We
@@ -208,7 +230,7 @@ results back to the very top of the CSA array for the next cycle. We can initial
208
230
CSA array with `0`s. Or, if we want to be clever, we can implement multiply accumulate by initializing one
209
231
of those two inputs with the accumulate value, and get multiply accumulate for free. This trick is what the
210
232
ARM7TDMI employs to do multiply accumulate. (This is a moot point, because the CPU is stupid and can only read two register values at a time per cycle. So, using an accumulate causes the CPU to take
211
-
an extra cycle _anyway_).
233
+
an extra cycle _anyway_). <sup>[[4, p.95](#cite4)]</sup>
212
234
213
235
214
236
# Early Termination
@@ -218,16 +240,22 @@ cycles of CSA compression, where each cycle `i` processes bits `8 * i` to `8 * i
218
240
zeros, then, we can skip that cycle, since the addends produced will be all zeros, which cannot possibly
219
241
affect the values of the partial result + partial carry. We can do the same trick if the remaining upper bits
220
242
are all ones (assuming we are performing a signed multiplication), as those also produce addends that
243
+
<<<<<<< HEAD
221
244
are all zeros.
245
+
=======
246
+
are all zeros. <sup>[[4, p.95](#cite4)]</sup>
247
+
248
+
>>>>>>> ab429a6 (fixes)
222
249
# Putting it all together
223
250
224
251
Here's a rough diagram, provided by Steve Furber in his book, Arm System-On-Chip Architecture:
225
252
226
253

254
+
<small> An image of the high level overview of the multiplier's organization, provided by Steve Furber in his book, Arm System-On-Chip Architecture. <sup>[[4, p.95](#cite4)]</sup> </small>
227
255
228
-
Partial Sum / Partial Carry contain the results obtained by the CSAs, and are rotated right by 8 on each cycle. Rm is recoded using booth's algorithm to produce the addends for the CSA array.
256
+
Partial Sum / Partial Carry contain the results obtained by the CSAs, and are rotated right by 8 on each cycle. Rm is recoded using booth's algorithm to produce the addends for the CSA array. <sup>[[4, p.95](#cite4)]</sup>
229
257
230
-
Ok, but remember when I said (make sure I said this) that there will be an elegant way to handle booth's negation of the addends? The way the algorithm gets around this is kind of genius. Remember how the carry output of a CSA has to be left shifted by 1? Well, this left-shift creates a zero in the LSB of the carry output of the CSA, so why don't we just put the carry in that bit? Like so:
258
+
Ok, but remember when I said (make sure I said this) that there will be an elegant way to handle booth's negation of the addends? The way the algorithm gets around this is kind of genius. Remember how the carry output of a CSA has to be left shifted by 1? Well, this left-shift creates a zero in the LSB of the carry output of the CSA, so why don't we just put the carry in that bit? <sup>[[5, p. 12](#cite5)]</sup> Like so:
231
259
<a name="perform_csa_array"></a>
232
260
233
261
```c
@@ -330,22 +358,22 @@ So fast forward about a year, I'm out for a walk and I decide to give this probl
330
358
331
359
I mean, it's kind of dumb, right? The entire issue is that the <spanstyle="color:#3a7dc9"> **multiplier**</span> is _too big_. Left shifting it would only exacerbate this issue. Congrats, we went from being able to process 7 bits on the first cycle to 6.
332
360
333
-
But pay attention to the **first addend** that would be produced. The corresponding **chunk** would either be `000` or `100`. Two options, both of which are really easy to compute. This is a **chunk** that would only exist on the first cycle of the algorithm. Coincidentally, if you refer to the diagram[have actual link or figure #] up above, you'll notice that, in the first cycle of the algorithm, we have an extra input in the CSA array that we initialized to zero. What if, instead, we initialize it to the addend produced by this mythical **chunk**?
361
+
But pay attention to the **first addend** that would be produced. The corresponding **chunk** would either be `000` or `100`. Two options, both of which are really easy to compute. This is a **chunk** that would only exist on the first cycle of the algorithm. Coincidentally, if you refer to the diagram[have actual link or figure #] up above, you'll notice that, in the first cycle of the algorithm, we have an extra input in the CSA array that we initialized to zero. What if, instead, we initialize it to the addend produced by this mythical **chunk**? <sup>[[5, p. 14](#cite5)]</sup>
334
362
335
363
It'd solve the issue. It'd get us the extra bit we needed, and make us match the ARM7TDMI's cycle counts completely.
336
364
337
365
But that's not all. Remember the carry flag from earlier? With this simple change, we go from matching hardware about 50% of the time (no better than randomly guessing) to matching hardware _**85%**_ of the time. This sudden increase was something no other theory was able to do, and made me really confident that I was on to something. However, this percentage only happens if we set the carry flag to bit `30` of the partial carry result, which seems super arbitrary. It turns out that bit of the partial carry result had a special meaning I did not realize at the time, and I would only find out that meaning much, much later.
338
366
339
367
# Mathematical Black Magic
340
368
341
-
It feels like we are finally making some sort of progress, however my algorithm still failed to calculate the carry flag properly around 15% of the time, and failed way more than that on long / signed multiplies. It was around this time that I found two patents[link later]that almost _entirely_ explained the algorithm. No idea how these hadn't been found up until this point, but they were quite illuminating.
369
+
It feels like we are finally making some sort of progress, however my algorithm still failed to calculate the carry flag properly around 15% of the time, and failed way more than that on long / signed multiplies. It was around this time that I found two patents, that almost _entirely_ explained the algorithm. No idea how these hadn't been found up until this point, but they were quite illuminating. <sup>[[5](#cite5)], [[6](#cite6)]</sup>
342
370
343
371
After reading the patents, it turns out my implementation of the CSA array is slightly flawed (see [`perform_csa_array`](#perform_csa_array) above). In particular, that function uses CSAs with a width of _64_ bits. That's way too large and wastes space on the chip - the actual hardware gets away with only using _31_.
344
372
345
373
Another difference is that my algorithm has no way yet of supporting long accumulate values. Sure, I can initialize the partial output with the accumulate value, but the partial output is only 32 bits wide.
346
374
347
375
348
-
Turns out, the patents describe a way to deal with both of these issues at once, using some mathematical trickery. This is the hardest part of the algorithm, so hang in there. (cite)
376
+
Turns out, the patents describe a way to deal with both of these issues at once, using some mathematical trickery. Pretty much the entire rest of this section is derived from [5, pp. 14-17]. This is the hardest part of the algorithm, so hang in there.
349
377
350
378
Roughly, on each CSA, we want to add three numbers together to produce two numbers. Let's give these five numbers some names. Define `S` to be a 33-bit value (even though the actual S is 32-bits, adding an extra bit allows us to handle both signed and unsigned multiplication) representing the previous CSA's sum, `C` to be a 33-bit value representing the previous CSA's carry, and `S'` and `C'` to be 33-bit values representing the resulting CSA sum / carry. Finally, define `X` to be a 34-bit value containing the current booths addend. Then we have:
351
379
@@ -449,7 +477,7 @@ Meaning `C'[32] = !A[2i+35]`.
449
477
450
478
451
479
452
-
And with that, we managed to go from using 64 bits of CSA, to only 33. Our final algorithm for the CSAs is as follows:
480
+
And with that, we managed to go from using 64 bits of CSA, to only 33. [5 pp. 14-17]Our final algorithm for the CSAs is as follows:
453
481
454
482
455
483
```C
@@ -556,7 +584,7 @@ Since `partial_sum` and `partial_carry` are shift registers that get rotated wit
556
584
557
585
Spoiler alert, the value of the carry flag after a multiply instruction comes from the carryout of this barrel shifter.
558
586
559
-
So, what rotation values does the ARM7TDMI use? According to the patents, for an unsigned multiply, all (1 or 2) uses of the barrel shifter do:
587
+
So, what rotation values does the ARM7TDMI use? According to one of the patents, for an unsigned multiply, all (1 or 2) uses of the barrel shifter do this. <sup>[[6, p. 9](#cite6)]</sup>
560
588
561
589
| # Iterations | Type | Rotation |
562
590
| - | - | - |
@@ -565,7 +593,7 @@ So, what rotation values does the ARM7TDMI use? According to the patents, for an
565
593
| 3 |ROR|6 |
566
594
| 4 |ROR|30 |
567
595
568
-
Signed multiplies differ from unsigned multiplies in their **second** barrel shift. The second one for signed multiplies looks like this:
596
+
Signed multiplies differ from unsigned multiplies in their **second** barrel shift. The second one for signed multiplies looks like this. <sup>[[6, p. 9](#cite6)]</sup>
569
597
570
598
| # Iterations | Type | Rotation |
571
599
| - | - | - |
@@ -576,7 +604,7 @@ Signed multiplies differ from unsigned multiplies in their **second** barrel shi
576
604
577
605
I'm not going to lie, I couldn't make sense of these rotation values. At all. Maybe they were wrong, since they patents already had a couple major errors at this point. No idea. Turns out it doesn't _really_ matter for calculating the carry flag of a multiply instruction. Observe the operation of the ARM7TDMI's `ROR` and `ASR`.
578
606
579
-
Code from fleroviux's NanoBoyAdvance:
607
+
Code from fleroviux's wonderful NanoBoyAdvance. <sup>[[7]](#cite7)</sup>
// Note that in booth's algorithm, the immediate argument will be true, and
@@ -705,3 +733,26 @@ if (is_long(flavor)) {
705
733
```
706
734
707
735
Anyway, that's basically it. If you're interested in the full code, take a look [here](https://github.com/zaydlang/multiplication-algorithm/tree/master).
736
+
737
+
# Works Cited
738
+
739
+
<aname="cite1"></a>
740
+
[1] “Advanced RISC Machines ARM ARM 7TDMI Data Sheet,” 1995. Accessed: Oct. 21, 2024. [Online]. Available: https://www.dwedit.org/files/ARM7TDMI.pdf
741
+
742
+
<aname="cite2"></a>
743
+
[2] “ASIC Design for Signal Processing,” Geoffknagge.com, 2024. https://www.geoffknagge.com/fyp/booth.shtml
744
+
745
+
<aname="cite3"></a>
746
+
[3] Wikipedia Contributors, “Carry-save adder,” Wikipedia, Sep. 17, 2024. https://en.wikipedia.org/wiki/Carry-save_adder
747
+
748
+
<aname="cite4"></a>
749
+
[4] Furber, Arm System-On-Chip Architecture, 2/E. Pearson Education India, 2001.
750
+
751
+
<aname="cite5"></a>
752
+
[5] D. J. Seal, G. Larri, and D. V. Jaggar, “Data Processing Using Multiply-accumulate Instructions,” Jul. 14, 1994
753
+
754
+
<aname="cite6"></a>
755
+
[6] G. Larri, “Data Processing Method And Apparatus Including Iterative Multiplier,” Mar. 11, 1994
0 commit comments