[Questions] Clarification on Multi-hop Inference Pipeline, Re-routing Overhead, SFT Details, and Naming Convention
Hi, thanks for the great work! After carefully reading the paper, I have several questions regarding the inference pipeline, training details, and the naming of the method. I'd appreciate any clarification.
1. Clarification on the Multi-hop Inference Pipeline
Based on my reading of Section 3.5 and Figure 3, I reconstructed the following inference pipeline for Memory Interleave. Could you confirm whether this understanding is correct?
loop:
1. Encode the current query context through the model to obtain Q^R
2. Route Q^R against all cached routing keys K̄^R → select Top-16 documents
3. Load selected documents' compressed K̄, V̄ from CPU to GPU
4. Autoregressively generate tokens with attention context = [Top-16 compressed KV ; local KV]
5. If the model generates [doc_id]<|object_ref_end|>:
→ Fetch the original text of the referenced document
→ Append the original text to the current query context
→ Go back to step 1 (re-encode, re-route)
6. If the model generates <End-of-Retrieve>:
→ Transition to final answer generation
→ Exit loop
Specific sub-questions:
- Is the pipeline above identical for both single-hop and multi-hop queries (i.e., a single unified pipeline where single-hop queries simply exit the loop after one iteration)?
- When appending original document text at step 5, does the system re-encode only the newly appended text (reusing the KV cache from the previous iteration for earlier tokens), or does it re-encode the entire expanded context from scratch?
2. Re-routing Overhead in Multi-hop Scenarios
Each iteration of the Memory Interleave loop requires:
- Re-encoding the appended original document text through the full model forward pass
- Re-routing
Q^R against all ~1.56M routing key entries (for 100M tokens) across 18 layers
- Re-loading potentially different Top-16 documents' content KV from CPU
For complex multi-hop queries that may require 3-5 iterations, this overhead compounds. Have you measured the per-hop latency breakdown? Specifically:
- What is the latency of the routing step alone (cosine similarity against all routing key entries) at the 100M token scale?
- How does the end-to-end multi-hop inference latency compare to an equivalent iterative RAG pipeline (e.g., multi-turn RAG with reranking)?
3. SFT Data Construction and Loss Computation
The paper mentions a two-stage SFT curriculum (Section 3.3.2) but provides limited details:
- Stage 1: SFT on QA tasks with 8K context length
- Stage 2: Extended to 64K context with data cleaning
I have the following questions:
-
Data construction: Could you provide more details on how the SFT training data was constructed? Specifically:
- How were the multi-hop retrieval chains decomposed into individual training samples (as mentioned: "each retrieval chain is divided into multiple training samples")?
- Were the document IDs and
<End-of-Retrieve> / <|object_ref_end|> tokens manually annotated in the training data, or generated through some automated pipeline?
-
Loss computation: During SFT, what loss function was used?
- Is it the standard next-token prediction loss (cross-entropy) only on the response tokens?
- Was
L_aux (the contrastive routing loss from pre-training) still active during SFT, or was it dropped?
- Was the loss computed over the generated document IDs as well, or only over the final answer tokens?
-
Potential data leakage: Since the SFT data presumably includes specific document IDs paired with specific queries, does this create a dependency on the document corpus used during training? In other words, how does the model generalize to entirely new document collections not seen during SFT?
4. Naming: "Memory Sparse Attention" vs. "Sparse Retrieval with Attention-based Fusion"
The name "Memory Sparse Attention" implies a modification to the attention mechanism itself that introduces sparsity (similar to Longformer, BigBird, or NSA). However, from my understanding, MSA does not modify the internal attention computation — the standard dense attention is preserved. The "sparsity" in MSA refers to selecting a sparse subset of external documents via a separate router projector, and then fusing their compressed KV caches into the standard attention context.
| Aspect |
Traditional Sparse Attention |
MSA |
| Sparsity scope |
Within a single sequence |
Across an external document bank |
| Sparsity granularity |
Token-level |
Document-level |
| Selection mechanism |
Attention scores / fixed patterns |
Separate router projector + cosine similarity |
| Operates on |
Full-resolution token representations |
Compressed (mean-pooled) KV cache |
Given these differences, would it be more accurate to characterize MSA as "Sparse Retrieval with Attention-based Fusion" rather than a sparse attention mechanism? I'd be interested to hear the authors' perspective on how MSA relates to the sparse attention lineage versus the retrieval-augmented generation lineage.
Thanks for your time! Looking forward to your response.
[Questions] Clarification on Multi-hop Inference Pipeline, Re-routing Overhead, SFT Details, and Naming Convention
Hi, thanks for the great work! After carefully reading the paper, I have several questions regarding the inference pipeline, training details, and the naming of the method. I'd appreciate any clarification.
1. Clarification on the Multi-hop Inference Pipeline
Based on my reading of Section 3.5 and Figure 3, I reconstructed the following inference pipeline for Memory Interleave. Could you confirm whether this understanding is correct?
Specific sub-questions:
2. Re-routing Overhead in Multi-hop Scenarios
Each iteration of the Memory Interleave loop requires:
Q^Ragainst all ~1.56M routing key entries (for 100M tokens) across 18 layersFor complex multi-hop queries that may require 3-5 iterations, this overhead compounds. Have you measured the per-hop latency breakdown? Specifically:
3. SFT Data Construction and Loss Computation
The paper mentions a two-stage SFT curriculum (Section 3.3.2) but provides limited details:
I have the following questions:
Data construction: Could you provide more details on how the SFT training data was constructed? Specifically:
<End-of-Retrieve>/<|object_ref_end|>tokens manually annotated in the training data, or generated through some automated pipeline?Loss computation: During SFT, what loss function was used?
L_aux(the contrastive routing loss from pre-training) still active during SFT, or was it dropped?Potential data leakage: Since the SFT data presumably includes specific document IDs paired with specific queries, does this create a dependency on the document corpus used during training? In other words, how does the model generalize to entirely new document collections not seen during SFT?
4. Naming: "Memory Sparse Attention" vs. "Sparse Retrieval with Attention-based Fusion"
The name "Memory Sparse Attention" implies a modification to the attention mechanism itself that introduces sparsity (similar to Longformer, BigBird, or NSA). However, from my understanding, MSA does not modify the internal attention computation — the standard dense attention is preserved. The "sparsity" in MSA refers to selecting a sparse subset of external documents via a separate router projector, and then fusing their compressed KV caches into the standard attention context.
Given these differences, would it be more accurate to characterize MSA as "Sparse Retrieval with Attention-based Fusion" rather than a sparse attention mechanism? I'd be interested to hear the authors' perspective on how MSA relates to the sparse attention lineage versus the retrieval-augmented generation lineage.
Thanks for your time! Looking forward to your response.