Support grad backprop when add/sub use broadcast by nogginly · Pull Request #87 · crystal-data/num.cr

nogginly · 2023-05-21T03:39:10Z

@christopherzimmerman, this fixes grad backpropagation for addition/subtraction when broadcast occurs because one operand has a different rank than the other.

Modified Add/Sub gates to sub-class the TwoOpGate
Made TwoOpGate an abstract class with #backward abstract method (since every subclass has to implement this method)
Modified operator +/- definition to pass in the operands
Modified add/sub backward methods to accept operands
Add a private convenience method to backprop the add/sub gradient
- I originally had a faster case for handling scalar add/sum, but had to remove it until a future time when Tensor#sum has OCL implementation. I'm not sure
- Let me know if there's a faster way to implement the rank-by-rank sum that I'm using iteratively.

I've included tests, which are based on results I get from PyTorch.

christopherzimmerman · 2023-05-24T16:17:34Z

@nogginly thanks for the PR, going to need a bit longer to review this since when I implemented this I guess I made some different design decisions around broadcasted operations than other libraries.

I always set the gradient to match the value that was actually used in the calculation, so in this case, since the broadcast happens during the operation, the gradient will match that shape. Are you saying that Pytorch aggregates that back down to match the dimensions of the initial variable, before the operation?

christopherzimmerman · 2023-05-24T16:18:17Z

Also, for rank by rank some, there is a "view_along_axis" iterator that's in some version of this code that gives a view into multiple axes that can probably be used to reduce multiple rank sums, I'll look for it.

nogginly · 2023-05-25T03:03:23Z

I always set the gradient to match the value that was actually used in the calculation, so in this case, since the broadcast happens during the operation, the gradient will match that shape. Are you saying that Pytorch aggregates that back down to match the dimensions of the initial variable, before the operation?

Hola @christopherzimmerman. Yes, that is correct, I wrote a simple test using PyTorch and got exactly that. I ran into this as I was implementing a two-layer MLP and when I tried to update the the biases using the gradient, the shapes were off and matmul didn't work and that is how I discovered this. The tests I put in are based on those PyTorch test cases.

nogginly added 3 commits May 20, 2023 23:30

Support grad backprop when add/sub use broadcast

b133b53

Cleaned up the scalar broadcast case

ae5670e

Add sclara broadcast test cases

c1819bd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support grad backprop when add/sub use broadcast#87

Support grad backprop when add/sub use broadcast#87
nogginly wants to merge 3 commits intocrystal-data:masterfrom
nogginly:grad_add_sub_broadcast

nogginly commented May 21, 2023

Uh oh!

christopherzimmerman commented May 24, 2023

Uh oh!

christopherzimmerman commented May 24, 2023

Uh oh!

nogginly commented May 25, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nogginly commented May 21, 2023

Uh oh!

christopherzimmerman commented May 24, 2023

Uh oh!

christopherzimmerman commented May 24, 2023

Uh oh!

nogginly commented May 25, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants