RetroMAE: Pre-Training Retrieval-oriented Language Models Via Masked Auto-Encoder (2205.12035v2)

Published 24 May 2022 in cs.CL

Abstract: Despite pre-training's progress in many important NLP tasks, it remains to explore effective pre-training strategies for dense retrieval. In this paper, we propose RetroMAE, a new retrieval oriented pre-training paradigm based on Masked Auto-Encoder (MAE). RetroMAE is highlighted by three critical designs. 1) A novel MAE workflow, where the input sentence is polluted for encoder and decoder with different masks. The sentence embedding is generated from the encoder's masked input; then, the original sentence is recovered based on the sentence embedding and the decoder's masked input via masked LLMing. 2) Asymmetric model structure, with a full-scale BERT like transformer as encoder, and a one-layer transformer as decoder. 3) Asymmetric masking ratios, with a moderate ratio for encoder: 15~30%, and an aggressive ratio for decoder: 50~70%. Our framework is simple to realize and empirically competitive: the pre-trained models dramatically improve the SOTA performances on a wide range of dense retrieval benchmarks, like BEIR and MS MARCO. The source code and pre-trained models are made publicly available at https://github.com/staoxiao/RetroMAE so as to inspire more interesting research.

Authors (4)

Shitao Xiao (38 papers)
Zheng Liu (312 papers)
Yingxia Shao (54 papers)
Zhao Cao (36 papers)

Citations (84)

View on Semantic Scholar

Summary

An Analytical Overview of RetroMAE: Enhancing Retrieval-Oriented LLMs

The paper "RetroMAE: Pre-Training Retrieval-oriented LLMs Via Masked Auto-Encoder" introduces a novel approach to pre-training LLMs specifically focused on dense retrieval tasks, which substantially contribute to enhancing NLP applications such as search engines and recommender systems. RetroMAE innovatively employs a Masked Auto-Encoder (MAE) mechanism with specific modifications in its architecture and masking strategy, aiming to address the limitations of conventional token-level pre-training models like BERT and RoBERTa in capturing sentence-level representations crucial for retrieval tasks.

Key Design Aspects

RetroMAE distinguishes itself through three critical innovations:

MAE Workflow: The authors propose a distinct MAE workflow, where the input sentence undergoes different masking processes for the encoder and decoder. The sentence embedding is derived from the encoder's masked input, while the original sentence is reconstructed using this embedding combined with the decoder's masked input through masked LLMing (MLM).
Asymmetric Architecture: The model adopts an asymmetric architecture with a full-scale BERT-like transformer as the encoder, while the decoder is a streamlined one-layer transformer. This architectural choice emphasizes the encoder's role in capturing discriminative sentence embeddings, while simplifying the decoder's complexity to make the reconstruction task more challenging.
Asymmetric Masking Ratios: Different masking ratios are applied to the encoder and decoder to enhance learning effectiveness. The encoder uses a moderate masking ratio of 15%-30%, while a more aggressive ratio of 50%-70% is applied for the decoder. This design choice ensures that the decoder cannot solely rely on its input for reconstruction, thus necessitating a high-quality sentence encoding.

Empirical Evaluation and Results

RetroMAE's efficacy is empirically validated across prominent benchmarks, demonstrating substantial improvements over state-of-the-art retrieval-oriented models. Specifically, the model achieves superior zero-shot performance on the BEIR benchmark, surpassing other models with notable gains. When fine-tuned on datasets like MS MARCO and Natural Questions, RetroMAE achieves remarkable performance enhancements compared to existing models, underlining its robustness and adaptability to both out-of-domain and in-domain retrieval tasks.

Implications and Future Prospects

The RetroMAE framework poses significant implications for both theoretical research and practical applications in the domain of AI and NLP. By effectively leveraging the masked auto-encoder paradigm with asymmetry in architecture and masking strategy, it aligns pre-training tasks closer to the demands of downstream dense retrieval tasks. This has the potential to influence the design of future retrieval-oriented models.

From a practical standpoint, the efficiency and performance gains demonstrated by RetroMAE suggest its utility in improving search engine capabilities, recommendation systems, and any applications that rely heavily on retrieving semantically related textual content. Moreover, its robustness across various benchmarks indicates a strong potential for generalized applications, which could lead to further advancements in deploying AI-driven NLP solutions in diverse domains.

Looking forward, further exploration could focus on extending the RetroMAE design principles to larger model scales and additional pre-training corpuses to fully realize their potential. Furthermore, the integration of advanced model architectures or newer pre-training tasks could be investigated to push the boundaries of retrieval-oriented model performance even further.

PDF Markdown

Related Papers

GitHub

GitHub - staoxiao/RetroMAE: Codebase for RetroMAE and beyond. (263 stars)

Tweets

https://twitter.com/DmitriyLeybel/status/1794180820229620168