Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling (2104.06967v2)

Published 14 Apr 2021 in cs.IR and cs.CL

Abstract: A vital step towards the widespread adoption of neural retrieval models is their resource efficiency throughout the training, indexing and query workflows. The neural IR community made great advancements in training effective dual-encoder dense retrieval (DR) models recently. A dense text retrieval model uses a single vector representation per query and passage to score a match, which enables low-latency first stage retrieval with a nearest neighbor search. Increasingly common, training approaches require enormous compute power, as they either conduct negative passage sampling out of a continuously updating refreshing index or require very large batch sizes for in-batch negative sampling. Instead of relying on more compute capability, we introduce an efficient topic-aware query and balanced margin sampling technique, called TAS-Balanced. We cluster queries once before training and sample queries out of a cluster per batch. We train our lightweight 6-layer DR model with a novel dual-teacher supervision that combines pairwise and in-batch negative teachers. Our method is trainable on a single consumer-grade GPU in under 48 hours (as opposed to a common configuration of 8x V100s). We show that our TAS-Balanced training method achieves state-of-the-art low-latency (64ms per query) results on two TREC Deep Learning Track query sets. Evaluated on NDCG@10, we outperform BM25 by 44%, a plainly trained DR by 19%, docT5query by 11%, and the previous best DR model by 5%. Additionally, TAS-Balanced produces the first dense retriever that outperforms every other method on recall at any cutoff on TREC-DL and allows more resource intensive re-ranking models to operate on fewer passages to improve results further.

PDF Abstract

Analyzing Efficient Training Strategies for Dense Retrieval Systems

The paper "Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling" presents notable advancements in the resource-efficient training of dense retrieval (DR) models. Dense retrieval systems, particularly those using BERT-based dual-encoders, have shown promise in first-stage retrieval by offering low-latency query responses through nearest neighbor searches. However, the computational expense associated with training these models remains a significant barrier to their broader adoption.

This paper introduces TAS-Balanced, a novel method that enhances batch sampling by focusing on Topic Aware Sampling (TAS) and balanced margin sampling of passage pairs. The approach aims to increase the efficiency of dense retrievers by employing two primary innovations: clustering queries before training and balancing the sampling of passage pairs to improve their training informativeness.

The TAS-Balanced method is complemented by a dual-teacher supervision framework that leverages both pairwise and in-batch negative teaching models. By combining the strengths of the BERT concatenated model ($\bertcat$) for pairwise teaching with the $\colbert$ in-batch negative strategy, the authors aim to capitalize on both efficient training and high-quality retrieval results. This strategy allows for training on a consumer-grade GPU within 48 hours—a significant reduction in resource requirements compared to other methods like ANCE and RocketQA.

The empirical evaluation of TAS-Balanced highlighted its state-of-the-art performance on the TREC Deep Learning Track datasets. The method achieved a notable 64ms latency per query while outperforming traditional retrieval methods like BM25 by 44% on nDCG@10. Compared to previously best-performing DR models, TAS-Balanced achieved a 5% improvement in nDCG@10, emphasizing its efficacy. Additionally, it marks the emergence of a dense retriever that surpasses competing methods at every recall cutoff on TREC-DL evaluation sets.

The exploration into different batch sampling strategies and loss functions further reinforced the robustness and adaptability of TAS-Balanced. In particular, using Margin-MSE loss in the dual-supervision framework presented a consistent advantage across datasets, driving improvements in recall and precision metrics. Testing different random seeds for cluster, query, and passage pair selections demonstrated the technique's stability with minimal performance variability, underscoring its reliability.

Implications of this work are substantial for the practical deployment of neural search engines. By lowering the hardware and time-cost thresholds, TAS-Balanced enables broader community access and facilitates further research into dense retrieval and related applications. This accessibility is paramount given the growing need for efficient and scalable NLP solutions.

Looking forward, the integration of TAS-Balanced into broader search architectures shows promise. The combination of TAS-Balanced with ranking models, like re-ranking with mono-duo-T5, indicates considerable room for improving overall search pipeline effectiveness. While current re-rankers benefit from increased recall, there remains potential for further optimization, especially with dense retriever-generated candidate sets.

In conclusion, the paper delivers a comprehensive technique to train effective dense retrieval models efficiently. TAS-Balanced significantly reduces the computational burden while maintaining or improving retrieval effectiveness. The approach sets a benchmark for future explorations in scalable and resource-efficient training methodologies for neural information retrieval.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Sebastian Hofstätter (31 papers)
Sheng-Chieh Lin (31 papers)
Jheng-Hong Yang (14 papers)
Jimmy Lin (208 papers)
Allan Hanbury (45 papers)

Citations (345)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - sebastian-hofstaetter/tas-balanced-dense-retrieval: SIGIR 2021: Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling (59 stars)

Tweets

https://twitter.com/_Guz_/status/1842278700773179752