Scaling Sparse and Dense Retrieval in Decoder-Only LLMs (2502.15526v1)

Published 21 Feb 2025 in cs.IR

Abstract: Scaling LLMs has shown great potential for improving retrieval model performance; however, previous studies have mainly focused on dense retrieval trained with contrastive loss (CL), neglecting the scaling behavior of other retrieval paradigms and optimization techniques, such as sparse retrieval and knowledge distillation (KD). In this work, we conduct a systematic comparative study on how different retrieval paradigms (sparse vs. dense) and fine-tuning objectives (CL vs. KD vs. their combination) affect retrieval performance across different model scales. Using MSMARCO passages as the training dataset, decoder-only LLMs (Llama-3 series: 1B, 3B, 8B), and a fixed compute budget, we evaluate various training configurations on both in-domain (MSMARCO, TREC DL) and out-of-domain (BEIR) benchmarks. Our key findings reveal that: (1) Scaling behaviors emerge clearly only with CL, where larger models achieve significant performance gains, whereas KD-trained models show minimal improvement, performing similarly across the 1B, 3B, and 8B scales. (2) Sparse retrieval models consistently outperform dense retrieval across both in-domain (MSMARCO, TREC DL) and out-of-domain (BEIR) benchmarks, and they demonstrate greater robustness to imperfect supervised signals. (3) We successfully scale sparse retrieval models with the combination of CL and KD losses at 8B scale, achieving state-of-the-art (SOTA) results in all evaluation sets.

Summary

Scaling Sparse and Dense Retrieval in Decoder-Only LLMs

The paper conducted by Zeng et al. (2025) provides a rigorous evaluation of the scaling behavior of decoder-only LLMs for information retrieval, specifically concentrating on dense and sparse retrieval paradigms as well as the impact of different fine-tuning objectives. Utilizing the LLaMA3 series models at various parameter scales (1B, 3B, and 8B), the research examines the effectivity of contrastive loss (CL) and knowledge distillation (KD) individually, as well as their combination, across both in-domain and out-of-domain benchmarks.

Key Findings

Contrastive Loss Efficacy: The paper finds that scaling behavior is pronounced when models are fine-tuned using CL. Larger models yield substantial performance improvements, demonstrating that increased parameters contribute positively when using CL. In contrast, KD-trained models exhibit minimal gains with increased model size, particularly within dense retrieval setups.
Sparse vs. Dense Retrieval: Sparse retrieval models consistently outperform dense counterparts across various benchmarks, highlighting their robustness particularly in out-of-domain testing. Sparse models demonstrate resilience to suboptimal supervised signals, suggesting a superiority in capturing lexical nuances over dense models which rely more on semantic abstraction.
Combination of CL and KD: By successfully integrating both CL and KD into sparse retrieval models at the 8B scale, the researchers achieve state-of-the-art results, significantly outperforming previous models like ColBERTv2 and RepLlama on benchmarks like BEIR and TREC DL.

Methodology

The evaluation framework was implemented using the MSMARCO dataset for training, with assessments conducted across MSMARCO Dev, TREC DL, and BEIR benchmarks. Sparse and dense retrieval methods were compared under fixed computational constraints, ensuring robust analysis of scaling effects. Sparse retrieval leverages high-dimensional sparse vectors, enabling it to retain more lexical information compared to dense retrieval's low-dimensional semantic embeddings. The bidirectional masking and masked next-token prediction (MNTP) modifications to the decoder-only LLMs were crucial in adapting the models for retrieval tasks.

Implications and Future Directions

The implications of these findings are substantial for the development of retrieval systems. The consistent performance of sparse retrieval over dense models across different scales and benchmarks underscores the importance of lexical information in retrieval tasks. This could push the research community to revisit and refine sparse retrieval paradigms further, especially in the context of large-scale LLMs. Additionally, the synergy of CL and KD points toward refined training strategies that balance the advantages of direct contrastive methods and teacher-guided enhancements.

Future explorations could center on broadening the scope of teacher models in KD to encompass varied architectural scales and functionalities. This may eventually allow for dynamically adjustable training paradigms that cater to specific data domains and retrieval challenges. Furthermore, expanding the evaluation to include different datasets and potentially tailoring retrieval models for multi-modal scenarios could provide comprehensive insights into their applicability and effectiveness.

In conclusion, the paper encapsulates pivotal findings regarding retrieval paradigms and model scaling in LLMs, providing a critical reflection on past approaches while setting a foundation for future advancements in both theoretical understanding and practical deployments of retrieval systems.

Tweets

https://twitter.com/HansiZeng/status/1894034808038162778

https://twitter.com/_reachsumit/status/1893866050548031772

Scaling Sparse and Dense Retrieval in Decoder-Only LLMs (2502.15526v1)

Summary