Exploring Training and Inference Scaling Laws in Generative Retrieval (2503.18941v2)

Published 24 Mar 2025 in cs.IR and cs.CL

Abstract: Generative retrieval reformulates retrieval as an autoregressive generation task, where LLMs generate target documents directly from a query. As a novel paradigm, the mechanisms that underpin its performance and scalability remain largely unexplored. We systematically investigate training and inference scaling laws in generative retrieval, exploring how model size, training data scale, and inference-time compute jointly influence performance. We propose a novel evaluation metric inspired by contrastive entropy and generation loss, providing a continuous performance signal that enables robust comparisons across diverse generative retrieval methods. Our experiments show that n-gram-based methods align strongly with training and inference scaling laws. We find that increasing model size, training data scale, and inference-time compute all contribute to improved performance, highlighting the complementary roles of these factors in enhancing generative retrieval. Across these settings, LLaMA models consistently outperform T5 models, suggesting a particular advantage for larger decoder-only models in generative retrieval. Our findings underscore that model sizes, data availability, and inference computation interact to unlock the full potential of generative retrieval, offering new insights for designing and optimizing future systems.

Summary

Analysis of Scaling Laws in Generative Retrieval

The paper "Exploring Training and Inference Scaling Laws in Generative Retrieval" introduces a framework for investigating the scaling behavior of generative retrieval systems, which utilize LLMs to generate document identifiers autoregressively. This paper discerns the influence of model size, data scale, and computational resources on the performance of generative retrieval, providing insights pivotal for enhancing the design and optimization of these systems.

Key Findings and Methodologies

In their exploration, the authors implemented a novel evaluation metric inspired by contrastive entropy, referred to as Contrastive Generation Loss (CGL). This metric addresses the limitation of traditional discrete retrieval metrics by offering a continuous performance signal capable of capturing nuanced variations in retrieval effectiveness across divergent methods of generative retrieval. By evaluating the probability of accurately generating document identifiers with respect to associated queries, CGL provides a more comprehensive view of retrieval performance.

The experiments were conducted using prominent architectures including T5 and the LLaMA models across varying sizes and retrieval strategies. Highlights from the results include:

Model Size Scaling: The research delineates a power-law relationship between model size and retrieval performance for n-gram-based generative retrieval methods. Larger models demonstrated systematically improved performance, with LLaMA models achieving a notably steeper scaling curve compared to T5 models.
Data Size Scaling: Increased training data volume provides substantial improvements in retrieval performance across both n-gram-based and codebook-based retrieval approaches. However, the n-gram-based methods exhibited a more pronounced scaling behavior, reflecting a stronger alignment with LLM capabilities.
Inference Scaling: In the domain of inference computations, the experimental evidence suggested substantial enhancement in retrieval performance with increased computational resources during inference. Specifically, n-gram-based approaches benefited from larger inference budgets, with LLaMA models showing marked improvement in performance scaling.

Theoretical and Practical Implications

The paper underscores several critical insights for future development in AI-driven retrieval systems. Primarily, it highlights that larger model architectures, particularly decoder-only ones like LLaMA, possess inherent advantages in leveraging both data scalability and computational resources to maximize retrieval performance. This supports the strategic planning in deploying computational resources effectively to optimize retrieval systems.

Furthermore, the findings reveal significant potential in deploying inference scaling for enhancing performance, which remains a relatively underexplored aspect in the context of generative retrieval. The results emphasize the importance of aligning retrieval methods with the inherent strengths of LLMs, promoting further investigation into adaptive scaling methods.

Future Directions in AI

The results from this research invite several avenues for future exploration. Secondary analysis involving more complex and diverse datasets could extend the scalability observations to embrace more intricate retrieval tasks. Further investigation into hybrid training objectives and architectures may reveal additional insights into enhancing the robustness and efficiency of generative retrieval systems.

Additionally, exploratory research into advanced training schemes for codebook-based methods could potentially unlock their scaling advantages if provided with adequate data and training epochs. Integration of ranking losses or leveraging discriminative training strategies may mitigate the learning challenges associated with novel identifier types, enhancing their scaling behavior.

Overall, this paper offers a comprehensive understanding of the generative retrieval landscape, inviting academia and industry alike to harness the power of scaling laws for advancing information retrieval technology. It serves both as a benchmark and a guiding framework for subsequent explorations in the field.

YouTube

Show All Videos