Understanding the Skill Gap in Recurrent Language Models: The Role of the Gather-and-Aggregate Mechanism (2504.18574v2)

Published 22 Apr 2025 in cs.LG and cs.AI

Abstract: State-space models (SSMs) offer efficient alternatives to Transformers for long sequences, but their fixed-size recurrent state limits capability on algorithmic tasks, such as retrieving past context. In this work, we examine how in-context retrieval operates in Transformer- and SSM-based LLMs and find that both rely on a similar Gather-and-Aggregate (G&A) mechanism: a Gather Head extracts relevant information pieces from context, which an Aggregate Head integrates into a single representation. In both architectures, G&A concentrates in a few heads, forming critical bottlenecks even for simple retrieval. For example, we show that disabling a single Gather or Aggregate Head in a pruned Llama-3.1-8B impairs retrieving the correct answer letter in MMLU, reducing its accuracy from 66% to 25% (random guessing). Moreover, this retrieval bottleneck can obscure limited knowledge demands of tasks as the pruned model succeeds on MMLU with functioning G&A heads yet fails on other knowledge benchmarks. The bottleneck similarly extends to tasks where SSMs typically underperform, such as GSM8K, BBH, and dialogue comprehension. We show that SSMs' retrieval challenges manifest in these heads, creating smoother attention patterns instead of the sharp token transitions effective G&A requires. Thus, the Transformer-SSM retrieval gap exists in just a few heads, rather than the entire LLM. This suggests a unified explanation for Transformer vs. SSM performance gap while showing how to merge their strengths. We find that pretrained hybrid models, where SSMs are combined with a few attention layers, delegate the role of Aggregate Heads to attention. Similarly, replacing a single G&A head in a pretrained SSM with an attention variant boosts retrieval and benchmark scores.

Collections

Summary

The paper demonstrates that a small set of Gather and Aggregate heads is crucial for in-context retrieval, with disabling one head dropping accuracy from 66% to near random.
It compares Transformers and SSMs, showing that SSMs need additional heads to mimic the sharp, discrete retrieval patterns of Transformers.
The findings provide practical guidance for designing hybrid architectures by strategically leveraging attention layers to support critical retrieval functions.

This paper, "Understanding the Skill Gap in Recurrent LLMs: The Role of the Gather-and-Aggregate Mechanism" (2504.18574), investigates the performance differences between Transformer-based and State-Space Model (SSM)-based LLMs, particularly focusing on their ability to perform in-context retrieval. The authors find that a key skill, concentrated in a small number of heads, is responsible for much of the observed performance gap on retrieval-intensive tasks.

The core finding is that both Transformer and SSM architectures develop a common mechanism for in-context retrieval, termed Gather-and-Aggregate (G&A). This mechanism involves two types of specialized heads:

Gather Heads: These heads identify and condense relevant information segments in the input context into a single representative token vector. In Transformers, this often manifests as attention patterns where the last token of a segment (like a newline at the end of a multiple-choice option) attends to all preceding tokens in that segment, summarizing the content.
Aggregate Heads: Located in subsequent layers, these heads process the summaries created by the Gather Heads. They attend to the representative tokens and combine their information to extract the specific data needed for the model's output, often resembling a weighted summation or argmax-like selection over the gathered summaries.

The paper demonstrates that this G&A functionality is highly concentrated. For example, in a pruned Llama-3.1-8B model, disabling a single Gather or Aggregate Head drastically reduced MMLU accuracy from 66% to near random guessing (25%). This highlights that these few heads act as critical bottlenecks for tasks requiring retrieval.

A significant practical implication of this finding is that the MMLU benchmark, often seen as primarily measuring world knowledge, is heavily dependent on this in-context retrieval capability. The authors show that a model with limited general knowledge (performing poorly on other benchmarks) can still score high on MMLU if its G&A mechanism is intact, but fails catastrophically if these specific heads are disabled. This suggests MMLU success is more about the algorithmic skill of retrieving from context than distributed factual knowledge for the models examined.

The paper explains the performance gap between Transformers and SSMs through the lens of G&A implementation. While SSMs also develop the G&A mechanism, their inherent architecture (fixed-size hidden state, smoother temporal mixing) makes it challenging to implement the sharp, discrete token transitions and selective attention patterns that effective G&A relies on, particularly for Aggregate Heads. SSMs exhibit smoother attention/mixing patterns compared to the sharp, localized attention of Transformers (\Cref{fig:heads_image}). This results in less precise gathering and aggregation, requiring SSMs to compensate by using more heads to achieve comparable, though often still lower, performance levels.

This analysis provides insights into hybrid models (combining SSM and attention layers) like Zamba and Llamba. The paper shows that these hybrids effectively bridge the retrieval gap by delegating the demanding Aggregate Head function to the attention layers, which are better suited for sharp aggregation.

Implementation and Application Guidance:

Identifying G&A Heads: The paper suggests identifying G&A heads through systematic ablation studies on tasks requiring in-context retrieval, such as MMLU or synthetic KV-Retrieval (\Cref{tab:kv_retrieval_comparison}). Heads whose removal causes a significant drop in performance on such tasks are candidates. Analyzing the attention/mixing patterns (e.g., attending to segment ends or final tokens) can further confirm their role.
Designing Efficient Hybrid Architectures:
- Training from Scratch: When building hybrid models from the ground up, the paper observes that G&A heads naturally emerge in the middle layers of networks (e.g., layers 16-17 in 32-layer models). A practical approach is to strategically place attention layers in these mid-network positions to efficiently support G&A, rather than distributing attention throughout the entire model, thus balancing retrieval capability with SSM efficiency.
- Distillation: When distilling a Transformer into an SSM-based hybrid, the Aggregate Head layer (like L17 in Llama-3.1-8B) is identified as a critical bottleneck for retrieval performance (\Cref{fig:hybrid_replacement}). A practical strategy is to retain the attention-based layer from the teacher model specifically at the position corresponding to the Aggregate Head, while distilling other layers into SSMs. This preserves the crucial retrieval capability where it matters most.
Analyzing Task Dependence: The paper highlights that the task format significantly impacts retrieval demands. Tasks requiring models to explicitly retrieve information embedded within the context (like MMLU's letter labels or chat-style dialogues) amplify the importance of strong G&A. Developers should be mindful of task format when evaluating SSM performance or designing models for specific applications.

Limitations and Considerations:

The identification of G&A layers can be complex, as interactions might span non-adjacent layers. Activation patching might be needed in such cases.
The paper focuses on a specific set of models and benchmarks. While the G&A mechanism appears general, its exact manifestation and the criticality of specific heads may vary across architectures and training data.
While G&A is critical for certain retrieval tasks, other mechanisms are involved in general in-context learning and copying, which warrant further investigation.

In summary, the paper provides a mechanistic explanation for a key limitation of SSMs in LLMing, localizing the issue to specific heads and a shared computational pattern. It offers practical guidance for improving SSM performance through targeted hybridization, focusing computational resources on the specific algorithmic bottleneck that differentiates them from Transformers on retrieval tasks.