You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a question about hybrid memory models "group attention" / self extend / shift code.
I am experimenting with several hybrid but not fully recurrent models.
In the code (main.cpp/server.cpp/passkey.cpp) I see the same sequences for shifting content and group attention context reduction via llama_memory_seq_rm(), llama_memory_seq_add() for shift and llama_memory_seq_div(), llama_memory_seq_add() for SelfExtend LFM2-VL-450M-Q8_0.gguf
and also falcon-h1-0.5b-instruct-q8_0.gguf
The LFM2-VL-450M-Q8_0.gguf model returns:
is_recurrent: 0
is_hybrid: 1
can_shift: 1
as it should and reading the code:
bool llama_memory_hybrid::get_can_shift() const {
// Shifting is trivially supported for recurrent
return mem_attn->get_can_shift();
}
one might expect that llama_memory_seq_rm() would work shifting the KV cache content but it does not because of:
bool llama_memory_hybrid::seq_rm(llama_seq_id seq_id, llama_pos p0, llama_pos p1) {
// Try removing from the recurrent cache first since it may fail. If it does
// fail, the cache will not have been mutated.
if (!mem_recr->seq_rm(seq_id, p0, p1)) {
return false;
}
return mem_attn->seq_rm(seq_id, p0, p1);
}
and understandably failing in:
bool llama_memory_recurrent::seq_rm(llama_seq_id seq_id, llama_pos p0, llama_pos p1) {
//printf("[DEBUG] calling llama_memory_recurrent::seq_rm` with `seq_id=%d, p0=%d, p1=%d`\n", seq_id, p0, p1);
uint32_t new_head = size;
if (p0 < 0) {
p0 = 0;
}
if (p1 < 0) {
p1 = std::numeric_limits<llama_pos>::max();
}
// models like Mamba or RWKV can't have a state partially erased
if (seq_id >= (int64_t) size) {
// could be fatal
return false;
}
if (0 <= seq_id) {
int32_t & tail_id = cells[seq_id].tail;
if (tail_id >= 0) {
const auto & cell = cells[tail_id];
// partial intersection is invalid
if ((0 < p0 && p0 < cell.pos) || (0 < p1 && p1 <= cell.pos)) {
//printf("[DEBUG] inside `llama_memory_recurrent::seq_rm`: partial intersection is invalid, so returning false\n");
return false;
}
// invalidate tails which will be cleared
if (p0 <= cell.pos && cell.pos < p1) {
tail_id = -1;
}
}
} else {
...
}
}
SelfExtend code path with grp_attn_n and grp_attn_w silently succeeds but make further calls to init_batch() fail with unable to find_slot()
Questions:
Is this expected behavior?
Can I detect situation like this earlier than attempting call to llama_memory_seq_rm() that is failing and returning false?
In all the code in main.cpp, server.cpp, passkey.cpp I do not see bool results of llama_memory_seq_rm() calls and any error reporting/recovery/mitigation for failure - which is probably means that surrounding code path has guarantees that the call will succeed but I've failed to find such guarantees. What am I missing there?
I did not deeply investigate grp_attn_n and grp_attn_w for the reason that I believe the hybrid models in question maybe do not need it at all if I correctly detect "thou cannot/should not shift, self extend" condition. If that is incorrect - any hints on what could go wrong for hybrid models?
Should there be checks and at least error logging for all llama_memory_seq_rm() calls that return false on failure?
Any help and clarification would be greatly appreciated.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Hi folks,
I have a question about hybrid memory models "group attention" / self extend / shift code.
I am experimenting with several hybrid but not fully recurrent models.
In the code (main.cpp/server.cpp/passkey.cpp) I see the same sequences for shifting content and group attention context reduction via
llama_memory_seq_rm(),llama_memory_seq_add()for shift andllama_memory_seq_div(),llama_memory_seq_add()for SelfExtendLFM2-VL-450M-Q8_0.gguf
and also
falcon-h1-0.5b-instruct-q8_0.ggufThe LFM2-VL-450M-Q8_0.gguf model returns:
as it should and reading the code:
one might expect that llama_memory_seq_rm() would work shifting the KV cache content but it does not because of:
calling:
and understandably failing in:
checking condition:
(0 < p0 && p0 < cell.pos) || (0 < p1 && p1 <= cell.pos)SelfExtend code path with
grp_attn_nandgrp_attn_wsilently succeeds but make further calls toinit_batch()fail with unable tofind_slot()Questions:
llama_memory_seq_rm()that is failing and returningfalse?llama_memory_seq_rm()calls and any error reporting/recovery/mitigation for failure - which is probably means that surrounding code path has guarantees that the call will succeed but I've failed to find such guarantees. What am I missing there?grp_attn_nandgrp_attn_wfor the reason that I believe the hybrid models in question maybe do not need it at all if I correctly detect "thou cannot/should not shift, self extend" condition. If that is incorrect - any hints on what could go wrong for hybrid models?llama_memory_seq_rm()calls that returnfalseon failure?Any help and clarification would be greatly appreciated.
Beta Was this translation helpful? Give feedback.
All reactions