Issues with implementation

Hi all, I tried to implement the info loss in my own GNN. I am using a custom convolution in a custom dataset that might have leakage, so this might be the source of error. But I am trying to understand why the model would behave the way it is behaving. I would appreciate any ideas/feedback.

My model is for link prediction on small subgraphs, where each for each edge I wanna predict, I sample a subgraph around it.

I am implementing the info_loss just like in your code: 
`info_loss = (edge_att * torch.log(edge_att/r + 1e-6) + (1-edge_att) * torch.log((1-edge_att)/(1-r+1e-6) + 1e-6)).mean()`

If I don't use any sort of info loss, when I train my model, my edge attention looks like this:
![image](https://user-images.githubusercontent.com/23004953/154397954-9bbbc3d1-b026-4e03-85df-07683cd948f4.png)

If I use l1 loss (just minimizing edge_att.mean()), my edge attention looks like this:
![image](https://user-images.githubusercontent.com/23004953/154397904-63a33154-e6c8-4b43-b7d5-cb26d4b1a0f5.png)

If I use l1 loss, but multiply by 1e-3, it looks like this:
![image](https://user-images.githubusercontent.com/23004953/154404027-5be36b03-fa0d-4240-a9df-694eb506c724.png)

However, if I use the info loss proposed in your paper, my edge attention agglutinates in values close to r. For example, for r = 0.3, I get the attention distribution below. If I use r=0.5, then the dense part of the historgram moves to the middle, and if I use something like r=0.7 or r=0.9, then all my attention weights are closer to 1.
![image](https://user-images.githubusercontent.com/23004953/154397815-45574336-bc29-4a35-8fc9-d9c1c688812c.png)

I tried to understand the intuition behind it by plotting the curve att x info_loss for different values of r
![image](https://user-images.githubusercontent.com/23004953/154397216-d60fd521-293d-4d61-a1c9-7472f6d9f6f2.png)

![image](https://user-images.githubusercontent.com/23004953/154397246-fd704ed3-c170-4c8b-9ff6-269c3dd72ee7.png)

![image](https://user-images.githubusercontent.com/23004953/154397282-e68d80e6-b6d0-48d6-afc2-613c7aee03a4.png)

So basically the info_loss is approximately zero when closer to r, and positive everywhere else. This is forcing my model to try to have the attention always close to r (which I am not sure if I understand why), and apparently this is exactly what my model is doing. What confuses me is that in your paper, r is recommended to be between 0.5 and 0.9. However, in my current setting, this forces the majority of my edge attention to be > 0.5, instead of making them sparse.

I wonder if I am doing something wrong, if info_loss should have a smaller weight, or if my concrete_sampler should have a higher temperature to force a bernoulli-like distribution, or if maybe my model simply doesnt really need the edges, and it is ok with using any edge_attention value, hacking a way to get the same solution just based on node embedding, for example, without message passing. Maybe I have excess of dropout during training? (I do both node and edge dropout).

Please let me know if you have any ideas. Thanks in advance! 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Issues with implementation #2

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issues with implementation #2

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions