AttentionInfluence: Adopting Attention Head Influence for Weak-to-Strong Pretraining Data Selection (2505.07293v1)

Published 12 May 2025 in cs.CL

Abstract: Recently, there has been growing interest in collecting reasoning-intensive pretraining data to improve LLMs' complex reasoning ability. Prior approaches typically rely on supervised classifiers to identify such data, which requires labeling by humans or LLMs, often introducing domain-specific biases. Due to the attention heads being crucial to in-context reasoning, we propose AttentionInfluence, a simple yet effective, training-free method without supervision signal. Our approach enables a small pretrained LLM to act as a strong data selector through a simple attention head masking operation. Specifically, we identify retrieval heads and compute the loss difference when masking these heads. We apply AttentionInfluence to a 1.3B-parameter dense model to conduct data selection on the SmoLLM corpus of 241B tokens, and mix the SmoLLM corpus with the selected subset comprising 73B tokens to pretrain a 7B-parameter dense model using 1T training tokens and WSD learning rate scheduling. Our experimental results demonstrate substantial improvements, ranging from 1.4pp to 3.5pp, across several knowledge-intensive and reasoning-heavy benchmarks (i.e., MMLU, MMLU-Pro, AGIEval-en, GSM8K, and HumanEval). This demonstrates an effective weak-to-strong scaling property, with small models improving the final performance of larger models-offering a promising and scalable path for reasoning-centric data selection.

Authors (4)

Kai Hua (3 papers)
Steven Wu (8 papers)
Ge Zhang (170 papers)
Ke Shen (20 papers)

Summary

The paper introduces AttentionInfluence, a novel method for selecting high-quality pretraining data for LLMs. The authors highlight the challenge of obtaining reasoning-intensive data efficiently and scalably, noting that existing methods often rely on supervised classifiers trained on human or LLM-generated labels, which can be labor-intensive, introduce bias, and may not generalize well.

AttentionInfluence proposes a training-free and supervision-free approach that leverages the intrinsic mechanisms of pretrained LLMs, specifically attention heads, to identify data samples conducive to developing reasoning abilities. The core idea is based on mechanistic interpretability research (Geva et al., 2020 , Olsson et al., 2022 , Wu et al., 24 Apr 2024 ) suggesting that specific attention heads (termed "retrieval heads") are crucial for in-context learning and reasoning. The method uses a small pretrained model to act as a data selector.

The practical implementation of AttentionInfluence involves three main steps:

Detecting Specific Important Heads: The paper focuses on identifying retrieval heads. This is done by evaluating a small pretrained model (a 1.3B LLaMA2-alike model in their experiments) on a synthetic 3-shot retrieval task. Attention scores for each head are computed, and heads are ranked by their retrieval score. The top 5% are identified as important retrieval heads.
Masking Operation: A "weak" reference model is created by masking the identified important heads in the base pretrained model. Masking means setting the attention weights of these heads to a uniform distribution (1/L, where L is the sequence length). This effectively disables their learned attention patterns, degrading the model's performance on tasks where these heads are important.
Calculating AttentionInfluence Score: For each data sample in the corpus, the token-level cross-entropy loss is computed using both the base model ( $\mathcal{L}_{\mathrm{base}}$ ) and the masked reference model ( $\mathcal{L}_{\mathrm{ref}}$ ). The AttentionInfluence score for a sample is calculated as the relative loss difference:

$\text{AttentionInfluence Score} = \frac{\mathcal{L}_{\mathrm{ref}} - \mathcal{L}_{\mathrm{base}}}{\mathcal{L}_{\mathrm{base}}}$

A higher score indicates that the sample's processing is more disrupted by masking the important heads, suggesting the sample relies more heavily on the mechanisms governed by these heads (e.g., retrieval and reasoning). Scores are compared within the same data domain. Data samples with high AttentionInfluence scores are then selected.

To validate AttentionInfluence, the authors used a 1.3B model to select data from the SmoLLM-Corpus (241B tokens), selecting the top 20% (approximately 73B tokens). They then pretrained a 7B dense model using the full SmoLLM-Corpus mixed with the selected subset. This was compared to a baseline 7B model trained only on the full SmoLLM-Corpus. Both models were trained for 1T tokens.

The experimental results demonstrate that pretraining with data selected by AttentionInfluence leads to consistent and substantial performance improvements on various benchmarks, particularly those requiring knowledge-intensive tasks and complex reasoning (Table 1). Improvements ranged from 1.4pp to 3.5pp on benchmarks like MMLU, MMLU-Pro, AGIEval-en, GSM8K, and HumanEval. The performance advantage emerged early in training and was sustained throughout the training process (Figure 1).

The paper further validates the method's reliability and the quality of selected data through analyses:

LLM-As-A-Judge: Using GPT-4o to evaluate samples, data selected by AttentionInfluence received significantly higher "Reasoning Scores" compared to data selected by a classifier (FineWeb-Edu Classifier), while maintaining comparable "Education Scores" (Table 4).
Sample Characteristics: AttentionInfluence tends to select longer samples, especially in code and math domains, containing more complex code, richer textual context, and elaborate formula-based reasoning.
Diversity: Word frequency analysis showed high overlap but also distinct terms favored by each method (e.g., "sklearn" vs. "19th"), suggesting complementarity. Clustering and visualization of sample embeddings showed that AttentionInfluence yields a more balanced and diverse distribution of selected data across content categories compared to the classifier (Figure 6, 7).
Scalability (Weak-to-Strong): The method exhibits a weak-to-strong scaling property. While the 1.3B model is used for selection, the data it selects significantly improves the performance of a larger 7B model. Using a larger model (7B) for the selection process itself resulted in selecting even higher quality and more generalizable data, leading to further performance gains on challenging tasks (Table 7, Figure 9).

The authors propose that AttentionInfluence provides a scalable and efficient path for reasoning-centric data selection without the need for human labeling, LLM generation, or training separate classifiers. It leverages the model's internal computation to infer data utility for specific capabilities.

Limitations discussed include the computational cost for scaling to very large models or extremely long training horizons, the impact on post-training stages like reinforcement learning, exploring its application to very long text contexts, investigating alternative methods for identifying important heads, and understanding the combined effects of multiple heads and the role of MLPs. The framework is flexible, allowing customization by defining different proxy tasks to identify heads relevant to other target capabilities.

For practical implementation:

Choose a Selector Model: Start with a moderately sized pretrained model (e.g., 1.3B or 7B parameters) based on a common architecture like LLaMA2.
Implement Proxy Task: Create a dataset for a proxy task relevant to the desired data quality (e.g., the synthetic retrieval task for reasoning-intensive data).
Identify Important Heads: Run the selector model on the proxy task dataset, collect attention scores, calculate the retrieval score (or other task-specific score) for each head, and rank them. Select a percentage of top heads (e.g., 5%).
Implement Masking: Modify the forward pass of the selector model to allow masking specific attention heads. This can involve zeroing out their contribution after the softmax or, as described, setting their output weights to uniform values before normalization.
Compute Loss Difference: Iterate through the large pretraining corpus. For each sample, compute the standard cross-entropy loss ( $\mathcal{L}_{\mathrm{base}}$ ) and the loss with the identified heads masked ( $\mathcal{L}_{\mathrm{ref}}$ ).
Calculate and Store Scores: Compute the AttentionInfluence score $(\mathcal{L}_{\mathrm{ref}} - \mathcal{L}_{\mathrm{base}}) / \mathcal{L}_{\mathrm{base}}$ for each sample. Store these scores, potentially organized by domain if necessary.
Select Data: Rank samples within domains by their AttentionInfluence scores and select a desired percentage (e.g., top 20%).
Compose Training Set: Combine the selected high-quality subset with the original corpus (or other datasets) for training the target LLM.

Computational considerations include running inference with the selector model twice (once standard, once masked) over the entire pretraining corpus, which can be significant but is generally much cheaper than training classifiers on labeled data or training large LLMs just for data selection. The choice of selector model size and the percentage of data selected will impact performance and computational cost.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/GeZhang86038849/status/1922182593791066351

https://twitter.com/GeZhang86038849/status/1922183267455009279

https://twitter.com/HuggingPapers/status/1923832430990873039

https://twitter.com/fly51fly/status/1923860164534469097

https://twitter.com/alexfdom/status/1938652815519285448

https://twitter.com/TheTuringPost/status/1924843650921734328