Any way to avoid keep hidden state of early steps in recurrent inference? #154

Fadelis98 · 2025-02-01T18:21:10Z

Fadelis98
Feb 1, 2025

I found this awesome project recently and I'm trying to use the fla layers in a non-LLM task, where we have a very long sequence and only the hidden state of the last "token" is useful. The current recurrent kernels, for example gated_deltanet, always returns the hidden state of every token, that will allocate huge memory. Is there any way except call the kernel token by token in a for loop that can avoid the memory allocation?

yzhangcs · 2025-02-01T18:24:18Z

yzhangcs
Feb 1, 2025
Maintainer

@Fadelis98 Hey

The current recurrent kernels, for example gated_deltanet, always returns the hidden state of every token, that will allocate huge memory.

This is not true as we only materialize the last hidden state ht only.

1 reply

Fadelis98 Feb 2, 2025
Author

@Fadelis98 Hey

The current recurrent kernels, for example gated_deltanet, always returns the hidden state of every token, that will allocate huge memory.

This is not true as we only materialize the last hidden state ht only.

@yzhangcs Thanks for your answer, and I think I didn't use the correct words that caused some misunderstanding.
Here's an example:

import torch
from fla.layers import GatedDeltaNet

batch_size, seq_len, head_dim, hidden_size = 32, 32, 32, 256
layer1 = GatedDeltaNet(hidden_size,head_dim=head_dim, mode="fused_recurrent").to(device="cuda",dtype=torch.float16)
layer1.eval()
x = torch.randn(batch_size, seq_len, hidden_size).to(device='cuda',dtype=torch.float16)
y = layer1(x)[0] # y has shape [batch_size,seq_len_hidden_size]

Here the shape of y contains a sqe_len dim instead of only the last output. So what I actually need is not the recurrent state but the output of the layer. I didn't find a way the control this behaviour without changing the triton code, is it possible in the current version or in a recent update?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FLA

Any way to avoid keep hidden state of early steps in recurrent inference? #154

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

FLA

Any way to avoid keep hidden state of early steps in recurrent inference? #154

Fadelis98 Feb 1, 2025

Replies: 1 comment · 1 reply

yzhangcs Feb 1, 2025 Maintainer

Fadelis98 Feb 2, 2025 Author

Fadelis98
Feb 1, 2025

Replies: 1 comment 1 reply

yzhangcs
Feb 1, 2025
Maintainer

Fadelis98 Feb 2, 2025
Author