Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Autoregressive Search Engines: Generating Substrings as Document Identifiers (2204.10628v1)

Published 22 Apr 2022 in cs.CL and cs.IR

Abstract: Knowledge-intensive language tasks require NLP systems to both provide the correct answer and retrieve supporting evidence for it in a given corpus. Autoregressive LLMs are emerging as the de-facto standard for generating answers, with newer and more powerful systems emerging at an astonishing pace. In this paper we argue that all this (and future) progress can be directly applied to the retrieval problem with minimal intervention to the models' architecture. Previous work has explored ways to partition the search space into hierarchical structures and retrieve documents by autoregressively generating their unique identifier. In this work we propose an alternative that doesn't force any structure in the search space: using all ngrams in a passage as its possible identifiers. This setup allows us to use an autoregressive model to generate and score distinctive ngrams, that are then mapped to full passages through an efficient data structure. Empirically, we show this not only outperforms prior autoregressive approaches but also leads to an average improvement of at least 10 points over more established retrieval solutions for passage-level retrieval on the KILT benchmark, establishing new state-of-the-art downstream performance on some datasets, while using a considerably lighter memory footprint than competing systems. Code and pre-trained models at https://github.com/facebookresearch/SEAL.

Citations (133)

Summary

  • The paper introduces SEAL, a method that leverages autoregressive LMs to generate valid substrings as document identifiers, improving retrieval performance.
  • It integrates constrained decoding with an FM-Index to ensure generated ngrams match the corpus, enhancing both efficiency and accuracy.
  • Experiments show SEAL achieves over 10-point improvements in passage-level R-precision across benchmarks like KILT and Natural Questions.

Overview of "Autoregressive Search Engines: Generating Substrings as Document Identifiers"

The paper "Autoregressive Search Engines: Generating Substrings as Document Identifiers," authored by Bevilacqua et al., introduces a novel approach to information retrieval for knowledge-intensive NLP tasks, focusing on leveraging the capabilities of autoregressive LLMs (AR LMs). This approach, named SEAL, resolves the requirement of combining an extensible search engine with a machine reading component that is tasked with both question-answering and evidence retrieval from vast corpora.

Methodology

The core innovation of SEAL lies in its technique to utilize AR LMs to generate substrings that act as identifiers for retrieving documents. Unlike previous methodologies that segment the search space into predefined hierarchical structures, this work argues for using ngrams present in passages as identifiers, avoiding rigid structural constraints.

Here are the key elements of their approach:

  • Constrained Decoding with FM-Index: By integrating an AR LM, specifically BART, with a full-text indexing structure, FM-Index, the model effectively constrains its generative capabilities to valid substrings present in the corpus. The FM-Index efficiently checks for ngram availability and ensures that generated identifiers correspond to existing substrings in the text corpus.
  • Intersection-Based Scoring: A new scoring function is employed, factoring both conditional ngram probabilities provided by the AR LMs and their index frequencies. This allows the system to prioritize more distinctive ngrams, resonating theoretically with TF-IDF principles but adapted to this generative context.
  • Efficiency and Effectiveness: This approach demonstrates substantial improvements over traditional passage-retrieval methods, achieving state-of-the-art results in various benchmarks such as the KILT benchmark and Natural Questions, while maintaining a lower memory footprint.

Experimental Evaluation

The empirical evaluation showcased SEAL outperforming established methods including dense retrievers like Dense Passage Retrieval (DPR) and autoregressive methods such as GENRE on several retrieval benchmarks. Specifically, SEAL's formulation achieved over 10 points of improvement in passage-level R-precision in the KILT benchmark. Additionally, downstream tasks demonstrated enhanced performance when utilizing the proposed retrieval method.

Implications and Future Directions

SEAL's model architecture offers efficient retrieval systems that could be extrapolated to larger scales or adapted for diversified applications beyond standard document retrieval. By balancing the need for structured search spaces with the flexibility of AR LMs, this work bridges the gap between generative models' NLU capabilities and retrieval tasks.

In future work, pursuing enhancements through larger models or optimizing constrained decoding for increased performance could further unlock potential advancements. Additionally, exploring dynamic variations in search space indexing might address challenges associated with evolving datasets or content updates.

Conclusion

"Autoregressive Search Engines: Generating Substrings as Document Identifiers" presents significant strides in the field by leveraging autoregressive models in retrieval tasks. The SEAL methodology not only advances state-of-the-art performance metrics in established benchmarks but also proposes a promising alternative to conventional and modern retrieval systems. The approach appreciates the nuanced challenges of passage retrieval, laying the groundwork for innovative applications in AI-driven knowledge systems.

Github Logo Streamline Icon: https://streamlinehq.com