Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Learning to Tokenize for Generative Retrieval (2304.04171v1)

Published 9 Apr 2023 in cs.IR

Abstract: Conventional document retrieval techniques are mainly based on the index-retrieve paradigm. It is challenging to optimize pipelines based on this paradigm in an end-to-end manner. As an alternative, generative retrieval represents documents as identifiers (docid) and retrieves documents by generating docids, enabling end-to-end modeling of document retrieval tasks. However, it is an open question how one should define the document identifiers. Current approaches to the task of defining document identifiers rely on fixed rule-based docids, such as the title of a document or the result of clustering BERT embeddings, which often fail to capture the complete semantic information of a document. We propose GenRet, a document tokenization learning method to address the challenge of defining document identifiers for generative retrieval. GenRet learns to tokenize documents into short discrete representations (i.e., docids) via a discrete auto-encoding approach. Three components are included in GenRet: (i) a tokenization model that produces docids for documents; (ii) a reconstruction model that learns to reconstruct a document based on a docid; and (iii) a sequence-to-sequence retrieval model that generates relevant document identifiers directly for a designated query. By using an auto-encoding framework, GenRet learns semantic docids in a fully end-to-end manner. We also develop a progressive training scheme to capture the autoregressive nature of docids and to stabilize training. We conduct experiments on the NQ320K, MS MARCO, and BEIR datasets to assess the effectiveness of GenRet. GenRet establishes the new state-of-the-art on the NQ320K dataset. Especially, compared to generative retrieval baselines, GenRet can achieve significant improvements on the unseen documents. GenRet also outperforms comparable baselines on MS MARCO and BEIR, demonstrating the method's generalizability.

Learning to Tokenize for Generative Retrieval: A Novel Approach for Document Identification

Introduction to Generative Retrieval and the Problem Space

The landscape of document retrieval techniques has significantly morphed with the advent of pre-trained LLMs (LMs), transitioning from the traditional index-retrieve paradigm to more sophisticated approaches like dense retrieval (DR) models. These models leverage the advancements in LMs to learn dense representations of queries and documents, significantly alleviating the issue of lexical mismatch. However, DR models are not without their limitations, primarily due to their index-retrieval pipeline and the misalignment between their learning strategies and the pre-training objectives of LMs.

A new paradigm, generative retrieval, emerges as an alternative, characterizing documents with identifiers (docids) and retrieving these documents by generating their docids end-to-end. This presents a promising avenue for better leveraging large LMs but introduces the challenge of defining appropriate document identifiers that can accurately capture document semantics.

Overview of GenRet

To tackle the nuances of generating semantically meaningful docids, the paper introduces GenRet, a novel document tokenization learning method optimized for generative retrieval tasks. GenRet adopts a discrete auto-encoding framework, coupled with a sequence-to-sequence retrieval model, to tokenize documents into concise, discrete representations. This approach includes several key components:

  • A tokenization model that generates docids for documents.
  • A reconstruction model that leverages these docids to reconstruct the original documents, ensuring the semantic integrity of the identified docids.
  • An end-to-end optimized generative model that accurately retrieves documents for a given query by autoregressively generating relevant docids.

Methodology and Implementation

The efficacy of GenRet is attributed to its comprehensive training scheme, which encompasses a progressive training methodology to capture the autoregressive nature of docid generation. This includes a series of losses: a reconstruction loss ensuring semantic capture, a commitment loss to prevent model forgetting, and a retrieval loss facilitating the optimization of retrieval performance. Additionally, GenRet addresses the challenge of docid diversity through a parameter initialization strategy and a novel docid re-assignment procedure based on diverse clustering techniques.

Experimental Results and Implications

GenRet was rigorously evaluated against existing state-of-the-art models across several benchmark datasets, including NQ320K, MS MARCO, and BEIR. The results were promising, establishing new performance benchmarks on the NQ320K dataset and demonstrating significant improvements, especially in the retrieval of unseen documents. GenRet's ability to considerably outperform previous methods in generalization reflects its robustness and versatility across various retrieval tasks.

Theoretical and Practical Contributions

This work makes several notable contributions to the domain of document retrieval. GenRet's discrete auto-encoding framework represents a pioneering approach to learn semantic docids, offering a significant leap towards resolving the lexical mismatch problem inherent in traditional retrieval methods. The proposed progressive training scheme and diverse clustering techniques further enhance the model's capability to produce and utilize semantically rich docids. From a practical standpoint, GenRet's conceptualization offers a scalable solution to the ever-growing demand for effective and efficient document retrieval systems.

Looking Ahead

Despite the demonstrable advancements introduced by GenRet, the exploration of document tokenization for generative retrieval is in its nascent stages. Future research directions could include expanding the model's scalability to accommodate larger document collections and further refining the tokenization learning process. Additionally, integrating generative pre-training within document tokenization presents a promising avenue for enhancing the semantic understanding of LMs.

In conclusion, GenRet marks a significant step forward in the quest for optimizing document retrieval tasks. Its innovative approach to learning document identifiers opens up new possibilities for leveraging generative models in information retrieval, setting the stage for future advancements in this exciting field.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Weiwei Sun (93 papers)
  2. Lingyong Yan (29 papers)
  3. Zheng Chen (221 papers)
  4. Shuaiqiang Wang (68 papers)
  5. Haichao Zhu (9 papers)
  6. Pengjie Ren (95 papers)
  7. Zhumin Chen (78 papers)
  8. Dawei Yin (165 papers)
  9. Maarten de Rijke (263 papers)
  10. Zhaochun Ren (117 papers)
Citations (53)
X Twitter Logo Streamline Icon: https://streamlinehq.com