Skip to content

Conversation

@gabe-l-hart
Copy link
Collaborator

Closes #16768

cc @leok7v

Description

This PR addresses context shift failure caused when a hybrid-recurrent model hits its context limit and attempts to perform context shifting. The main change here is to loosen the restriction in the llama_memory_recurrent::seq_rm to only refuse to do a partial erasure if the part being erased includes the final token in the sequence. Since recurrent states are fixed size, any partial erasure that does not include the final token can be considered a no-op.

Testing

To validate the result, you can use the following which artificially limits the context length to force a context shift:

# You can use any granite-4.0 model here
./bin/llama-cli -m ggml-org/granite-4.0-h-small-Q8_0-GGUF --jinja -c 100 --context-shift -p "tell me a story"

Without this fix, it will fail with init_batch: failed to prepare attention ubatches, but with this fix, it will successfully continue generating and produce generated output that is relevant to the previous context.

The recurrent state is always assumed to be the state as of the last update
from the final token in the sequence. When doing a partial erasure, if the
range does not include the final token, the erasure can be considered a
success since any memory used for the sequence prior to the final token
(which is no memory) has been successfully removed.

There is one potential case that this doesn't address which is the pruning
of cache to remove sensitive data from the context. This wouldn't work for
attention cache partial removal (in the middle) either since the KV state
is linearly-dependent and states in later sequence positions would still be
based on the state from the sensitive data, even if that data is no longer
cached, so I don't think this is relevant, but it is worth noting that the
semantics of this change for a partial erasure in the middle of the cache
are essentially "my context is already compressed" and not "all trace of
the removed tokens has been removed."

ggml-org#16768
Branch: HybridContextShift-16768

Signed-off-by: Gabe Goodhart <[email protected]>
This prefix matching is explicitly attempting to remove the tokens at the
end of the sequence that don't match. This is the operation that can't be
performed on a recurrent cache due to the state being updated in place, so
if this removal fails, we need to clear the whole cache.

ggml-org#16768
Branch: HybridContextShift-16768

Signed-off-by: Gabe Goodhart <[email protected]>
// partial intersection is invalid
if ((0 < p0 && p0 < cell.pos) || (0 < p1 && p1 <= cell.pos)) {
// partial intersection is invalid if it includes the final pos
if ((0 < p0 && p0 <= cell.pos && p1 > cell.pos)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if ((0 < p0 && p0 <= cell.pos && p1 > cell.pos)) {
if (0 < p0 && p0 <= cell.pos && p1 > cell.pos) {

// partial intersection is invalid
if ((0 < p0 && p0 < cell.pos) || (0 < p1 && p1 <= cell.pos)) {
// partial intersection is invalid if it includes the final pos
if ((0 < p0 && p0 <= cell.pos && p1 > cell.pos)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we check strictly larger than 0 rather than 0 <= p0?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

The result of bool llama_memory_seq_rm() is not checked

3 participants