Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More (2502.07490v2)

Published 11 Feb 2025 in cs.CL and cs.LG

Abstract: LLMs are discovered to suffer from accurately retrieving key information. To address this, we propose Mask-Enhanced Autoregressive Prediction (MEAP), a simple yet effective training paradigm that seamlessly integrates Masked LLMing (MLM) into Next-Token Prediction (NTP) to enhance the latter's in-context retrieval capabilities. Specifically, MEAP first randomly masks a small fraction of input tokens and then directly performs the standard next-token prediction autoregressive using a decoder-only Transformer. MEAP eliminates the need for bidirectional attention or encoder-decoder architectures for MLM, incurring no additional computational overhead during pre-training or inference. Intensive experiments demonstrate that MEAP substantially outperforms NTP on key information retrieval and long-context reasoning tasks, while performing on par or better on commonsense reasoning tasks. The benefits of MEAP also extend to supervised fine-tuning, where it shows remarkable advantages in lost-in-the-middle scenarios, outperforming NTP by 11.77 percentage points. Our analysis indicates that MEAP's effectiveness arises from its ability to promote more distinguishable attention scores by concentrating on a reduced set of non-masked tokens. This mechanism improves the model's focus on task-relevant signals while mitigating the influence of peripheral context. These findings position MEAP as a promising training paradigm for LLMs.

Summary

The paper presents a novel MEAP approach that integrates masked token prediction into autoregressive modeling, enhancing key information retrieval from long contexts.
It achieves an 11.77% improvement in commonsense reasoning tasks and matches the performance of models trained with over 200B tokens using just 60B tokens.
MEAP also demonstrates resilience against hallucinations, outperforming standard NTP in multi-document question answering by up to 27.2 percentage points.

An Expert Overview of "Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More"

The paper "Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More" introduces a novel training paradigm for LLMs, specifically designed to address the challenges associated with accurately retrieving key information from extensive contexts. This paradigm, termed Mask-Enhanced Autoregressive Prediction (MEAP), integrates Masked LLMing (MLM) into the Next-Token Prediction (NTP) framework while maintaining the latter's scalability and efficiency.

MEAP is implemented as a straightforward modification to the standard autoregressive decoder-only Transformer architecture. It involves randomly masking a small number of tokens in the input sequence and then performing standard next-token prediction as usual. By circumventing the need for complex encoder-decoder architectures or bidirectional attention mechanisms typically associated with MLM, MEAP avoids additional computational overhead during both pre-training and inference phases.

Extensive experiments denote that MEAP supersedes traditional NTP in tasks demanding key information retrieval and reasoning over long contexts. The paradigm excellently retains performance in commonsense reasoning tasks and even enhances supervised fine-tuning in scenarios prone to "lost-in-the-middle" errors, with an observed improvement of 11.77% over NTP. This improvement is attributed to the model's ability to produce more distinguishable attention patterns, concentrating attention effectively on non-masked, task-relevant tokens and minimizing focus on the broader peripheral context.

Key Findings and Contributions

The paper presents compelling empirical results verifying MEAP's superior performance. Noteworthy findings include:

In the Needle in a Haystack task, MEAP-trained models achieve 85.8% accuracy with 60 billion training tokens, achieving comparable results to NTP-trained models that required significantly more—up to 200 billion tokens.
In Multi-Document Question Answering, MEAP outperforms NTP by up to 27.2 percentage points in key information retrieval tasks, displaying proficiency in handling vast and complex document contexts.
Unlike NTP, MEAP demonstrates resilience against the common issue of hallucinations in LLMs, improving the accuracy of outputs across several summarization datasets.

Theoretical Implications

MEAP exemplifies an innovative approach to augmenting NTP with MLM benefits without compromising the intrinsic properties of each paradigm. It leverages the discipline of attention mechanisms to foster focused processing of input sequences. The proposed masking strategy subtly alters the attention scoring, amplifying task-relevant signals while suppressing extraneous context influence. This refinement results in enhanced data efficiency and sustained performance regardless of model size or task complexity.

Practical Implications and Future Directions

MEAP stands as a promising training strategy poised for integration into existing LLM pipelines without necessitating architectural overhauls or significant resource investments. Its ability to maintain efficiency while enhancing contextual understanding reinforces its utility for both pre-training and fine-tuning phases. Future explorations might extend MEAP's application to other forms of sequential modeling tasks, further examining its adaptability and performance across different language and context lengths.

The paper successfully challenges the established assumptions within LLM training objectives, proposing a novel paradigm that significantly enhances LLMs' intrinsic capabilities by focusing more intently on the core principles of attention mechanisms and context utilization. As such, MEAP could mark a valuable direction for practical and scalable improvements in LLM architectures.

PDF Markdown

Tweets

https://twitter.com/arXivGPT/status/1890100641806106815