COIL: Revisit Exact Lexical Match in Information Retrieval with Contextualized Inverted List (2104.07186v1)

Published 15 Apr 2021 in cs.IR and cs.CL

Abstract: Classical information retrieval systems such as BM25 rely on exact lexical match and carry out search efficiently with inverted list index. Recent neural IR models shifts towards soft semantic matching all query document terms, but they lose the computation efficiency of exact match systems. This paper presents COIL, a contextualized exact match retrieval architecture that brings semantic lexical matching. COIL scoring is based on overlapping query document tokens' contextualized representations. The new architecture stores contextualized token representations in inverted lists, bringing together the efficiency of exact match and the representation power of deep LLMs. Our experimental results show COIL outperforms classical lexical retrievers and state-of-the-art deep LM retrievers with similar or smaller latency.

Citations (198)

View on Semantic Scholar

Summary

The paper introduces COIL, a novel architecture combining exact lexical matching with contextualized token embeddings to enhance semantic search.
It overcomes vocabulary mismatches by integrating deep language model insights into inverted lists, outperforming both classic and neural retrieval systems.
COIL achieves higher retrieval metrics with lower latency, demonstrating superior efficiency on benchmarks like MSMARCO and TREC DL tasks.

Contextualized Inverted Lists: Advancing Information Retrieval with COIL

The paper "COIL: Revisit Exact Lexical Match in Information Retrieval with Contextualized Inverted List," introduces COIL, a novel architecture for information retrieval that synergizes the robustness of classical exact lexical match systems with the semantic depth of modern neural models. The authors propose a new approach to leverage token contextualization in information retrieval, overcoming the vocabulary and semantic mismatch issues inherent in traditional and neural systems.

Theoretical Contribution

Classical information retrieval (IR) mechanisms, such as BM25, rely heavily on exact lexical matching, which limits their semantic understanding and adaptability. While recent neural IR models have shifted toward soft semantic matching, they often sacrifice computational efficiency. COIL presents an innovative architecture that captures the benefits of both paradigms. By matching contextualized representations of overlapping tokens rather than relying solely on lexical similarity, COIL enhances semantic matching capabilities without introducing significant computational drawbacks.

Key to COIL's architecture is its use of inverted lists, a well-established indexing method that efficiently maps terms to documents. In contrast to traditional approaches that store term frequency or statistical data, COIL's inverted lists contain contextualized vector representations of tokens powered by deep LLMs. This modification allows retrieval systems to discern and leverage the semantic context of terms effectively.

Experimental Evaluation

Empirical results demonstrate COIL's potential: it consistently outperforms both classical lexical retrievers and state-of-the-art neural retrievers across a range of tasks, including two major benchmarks from the TREC 2019 Deep Learning track: the MSMARCO passage and document collections. Noteworthy is COIL's performance relative to competitive systems such as ColBERT, a neural IR model utilizing all-to-all soft matching—in which COIL maintains similar efficacy but operates with reduced complexity and computational expense.

COIL's design features two main representations: COIL-tok, which focuses solely on contextualized token matching, and COIL-full, which incorporates additional matching through CLS token vectors to handle vocabulary mismatch. The latter achieves higher recall rates and maintains significant gains in metrics such as MRR and NDCG.

Efficiency and Scalability

One significant advantage of COIL is its ability to deliver improvements without substantially increasing latency. The paper provides a comprehensive analysis of how varying token and CLS vector dimensions impact the trade-off between retrieval effectiveness and efficiency. COIL operates faster than many dense vector retrieval systems and exhibits lower latency compared to ColBERT, even without optimizations such as approximate search or vector quantization that could further enhance its performance.

Implications and Future Directions

COIL signifies a substantial shift in retrieval strategies by demonstrating that exact lexical match systems, when augmented with contextualized representations, can achieve semantic-rich search capabilities. This development underscores the potential for hybrid models that merge classical IR structures with modern neural techniques.

Looking forward, COIL's architecture can inspire future explorations into indexing methodologies, vector quantization, and query processing optimizations. The successful integration of dense vector retrieval with lexical match signals presents a pathway towards more efficient, robust IR systems that do not compromise on performance or interpretability. Additionally, deploying COIL within production systems could further validate its real-world applicability and stimulate enhancements in large-scale semantic indexing and retrieval. Overall, COIL provides a framework that can stimulate continued innovation at the intersection of textual and semantic information retrieval.

PDF Markdown

Related Papers

YouTube

Show All Videos