Investigating Mixture of Experts in Dense Retrieval (2412.11864v1)

Published 16 Dec 2024 in cs.IR and cs.AI

Abstract: While Dense Retrieval Models (DRMs) have advanced Information Retrieval (IR), one limitation of these neural models is their narrow generalizability and robustness. To cope with this issue, one can leverage the Mixture-of-Experts (MoE) architecture. While previous IR studies have incorporated MoE architectures within the Transformer layers of DRMs, our work investigates an architecture that integrates a single MoE block (SB-MoE) after the output of the final Transformer layer. Our empirical evaluation investigates how SB-MoE compares, in terms of retrieval effectiveness, to standard fine-tuning. In detail, we fine-tune three DRMs (TinyBERT, BERT, and Contriever) across four benchmark collections with and without adding the MoE block. Moreover, since MoE showcases performance variations with respect to its parameters (i.e., the number of experts), we conduct additional experiments to investigate this aspect further. The findings show the effectiveness of SB-MoE especially for DRMs with a low number of parameters (i.e., TinyBERT), as it consistently outperforms the fine-tuned underlying model on all four benchmarks. For DRMs with a higher number of parameters (i.e., BERT and Contriever), SB-MoE requires larger numbers of training samples to yield better retrieval performance.

Citations (1)

View on Semantic Scholar

Collections

Sign up for free to add this paper to one or more collections.

Sign Up

Summary

The paper demonstrates that appending a Mixture-of-Experts block can considerably enhance dense retrieval performance, particularly in lightweight architectures like TinyBERT.
The study employs a rigorous evaluation on TinyBERT, BERT, and Contriever across four benchmark datasets to compare SB-MoE models against traditional fine-tuning.
The paper reveals that the SB-MoE approach effectively addresses generalizability issues while adding minimal model complexity, offering practical benefits for IR tasks.

Investigating Mixture of Experts in Dense Retrieval

This paper presents a methodical exploration into the integration of Mixture-of-Experts (MoE) architectures within Dense Retrieval Models (DRMs). The researchers have focused their efforts on understanding the efficacy of adding a single Mixture-of-Experts block (SB-MoE) post the final Transformer layer in DRMs. They aim to address the generalizability and robustness limitations typically associated with DRMs in Information Retrieval (IR).

Introduction to the Study

DRMs have shown superiority over traditional sparse, lexicon-based retrieval models by effectively capturing semantic contexts of queries and documents. However, despite their potential, DRMs necessitate extensive labeled data and suffer from generalizability issues across diverse tasks. This paper proposes the adoption of a single Mixture-of-Experts block to enhance the retrieval capacity of DRMs. The underlying principle of MoE involves utilizing multiple sub-networks, or experts, each trained to solve different parts of the task in an unsupervised manner. Prior research incorporated MoEs within each Transformer layer, a setup that substantially increases model complexity. The paper introduces an approach where an SB-MoE is applied to the final layer's output embedding, purportedly optimizing model performance without adding significant complexity.

Experimental Setup

The empirical evaluation of the SB-MoE architecture was conducted on three DRMs: TinyBERT, BERT, and Contriever, across four benchmark datasets. These datasets include Natural Questions (NQ) and HotpotQA for open-domain question answering, and two collections from the Multi-Domain Benchmark, focusing on Political Science and Computer Science domains. By varying the number of experts and comparing SB-MoE enhanced models to fine-tuned baselines, the researchers aimed to address two main questions: how the SB-MoE architecture compares to standard fine-tuning, and the influence of the number of experts on retrieval performance.

Key Findings

The results demonstrated that the integration of SB-MoE with DRMs, particularly those with fewer parameters, such as TinyBERT, consistently outperformed traditional fine-tuning. For instance, SB-MoE showed marked improvements on the HotpotQA dataset in terms of NDCG@10. Larger models like BERT and Contriever displayed marginal improvements, suggesting that SB-MoE's benefits are more pronounced in lightweight architectures. The performance was also found to be sensitive to the number of experts; however, optimization strategies appear to be domain-specific. A thorough parameter tuning for the number of experts is recommended to maximize performance across different IR tasks.

Implications and Future Directions

This paper highlights the potential for using modular MoE architectures in enhancing the functionality and efficiency of DRMs within various task domains. The approach marks a significant efficiency gain without proportionate complexity increase in model size or training times, particularly beneficial for applications using lightweight models. For broader adoption, future research could focus on dynamically determining the optimal number of experts based on dataset characteristics or utilizing more sophisticated gating techniques that further refine expert selection.

This work supports the argument that integrating MoE architectures judiciously with neural retrieval models can yield tangible performance improvements and adapt better across tasks without unwarranted increases in computational demand. Further investigations could pave the way for MoE's integration to more model architectures and tasks, potentially optimizing retrieval processes widely within AI-driven applications in IR.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Authors (4)

Tweets

https://twitter.com/gm8xx8/status/1868968719264698811