Scaling Sparse and Dense Retrieval in Decoder-Only LLMs
The paper conducted by Zeng et al. (2025) provides a rigorous evaluation of the scaling behavior of decoder-only LLMs for information retrieval, specifically concentrating on dense and sparse retrieval paradigms as well as the impact of different fine-tuning objectives. Utilizing the LLaMA3 series models at various parameter scales (1B, 3B, and 8B), the research examines the effectivity of contrastive loss (CL) and knowledge distillation (KD) individually, as well as their combination, across both in-domain and out-of-domain benchmarks.
Key Findings
- Contrastive Loss Efficacy: The paper finds that scaling behavior is pronounced when models are fine-tuned using CL. Larger models yield substantial performance improvements, demonstrating that increased parameters contribute positively when using CL. In contrast, KD-trained models exhibit minimal gains with increased model size, particularly within dense retrieval setups.
- Sparse vs. Dense Retrieval: Sparse retrieval models consistently outperform dense counterparts across various benchmarks, highlighting their robustness particularly in out-of-domain testing. Sparse models demonstrate resilience to suboptimal supervised signals, suggesting a superiority in capturing lexical nuances over dense models which rely more on semantic abstraction.
- Combination of CL and KD: By successfully integrating both CL and KD into sparse retrieval models at the 8B scale, the researchers achieve state-of-the-art results, significantly outperforming previous models like ColBERTv2 and RepLlama on benchmarks like BEIR and TREC DL.
Methodology
The evaluation framework was implemented using the MSMARCO dataset for training, with assessments conducted across MSMARCO Dev, TREC DL, and BEIR benchmarks. Sparse and dense retrieval methods were compared under fixed computational constraints, ensuring robust analysis of scaling effects. Sparse retrieval leverages high-dimensional sparse vectors, enabling it to retain more lexical information compared to dense retrieval's low-dimensional semantic embeddings. The bidirectional masking and masked next-token prediction (MNTP) modifications to the decoder-only LLMs were crucial in adapting the models for retrieval tasks.
Implications and Future Directions
The implications of these findings are substantial for the development of retrieval systems. The consistent performance of sparse retrieval over dense models across different scales and benchmarks underscores the importance of lexical information in retrieval tasks. This could push the research community to revisit and refine sparse retrieval paradigms further, especially in the context of large-scale LLMs. Additionally, the synergy of CL and KD points toward refined training strategies that balance the advantages of direct contrastive methods and teacher-guided enhancements.
Future explorations could center on broadening the scope of teacher models in KD to encompass varied architectural scales and functionalities. This may eventually allow for dynamically adjustable training paradigms that cater to specific data domains and retrieval challenges. Furthermore, expanding the evaluation to include different datasets and potentially tailoring retrieval models for multi-modal scenarios could provide comprehensive insights into their applicability and effectiveness.
In conclusion, the paper encapsulates pivotal findings regarding retrieval paradigms and model scaling in LLMs, providing a critical reflection on past approaches while setting a foundation for future advancements in both theoretical understanding and practical deployments of retrieval systems.