QueryGym: A Toolkit for Reproducible LLM-Based Query Reformulation (2511.15996v1)

Published 20 Nov 2025 in cs.IR and cs.CL

Abstract: We present QueryGym, a lightweight, extensible Python toolkit that supports LLM-based query reformulation. This is an important tool development since recent work on LLM-based query reformulation has shown notable increase in retrieval effectiveness. However, while different authors have sporadically shared the implementation of their methods, there is no unified toolkit that provides a consistent implementation of such methods, which hinders fair comparison, rapid experimentation, consistent benchmarking and reliable deployment. QueryGym addresses this gap by providing a unified framework for implementing, executing, and comparing LLM-based reformulation methods. The toolkit offers: (1) a Python API for applying diverse LLM-based methods, (2) a retrieval-agnostic interface supporting integration with backends such as Pyserini and PyTerrier, (3) a centralized prompt management system with versioning and metadata tracking, (4) built-in support for benchmarks like BEIR and MS MARCO, and (5) a completely open-source extensible implementation available to all researchers. QueryGym is publicly available at https://github.com/radinhamidi/QueryGym.

Summary

The paper introduces QueryGym, a toolkit unifying diverse LLM query reformulation methods to enhance retrieval performance.
It offers a modular Python API with retrieval-agnostic interfaces and centralized prompt management for reproducible experiments.
The toolkit supports benchmarks like BEIR and MS MARCO, ensuring consistent and fair comparisons across varied IR tasks.

QueryGym: A Toolkit for Reproducible LLM-Based Query Reformulation

QueryGym is introduced as a versatile Python toolkit aimed at supporting research centered on LLM-based query reformulation, a key aspect in enhancing retrieval effectiveness. By unifying disparate implementations of LLM-based query reformulation methods, QueryGym enables fair comparisons, consistent benchmarking, and reliable deployment across varied information retrieval (IR) scenarios.

Motivation and Objectives

Query reformulation and expansion are critical to the performance of information retrieval systems, particularly when initial user queries are vague or contextually incomplete. Modern advances in LLMs have facilitated the generation of enriched query variants that can bridge the gap between user intent and document relevance. However, the existing landscape lacks a coherent, reproducible software framework that supports systematic development and experimentation in this domain.

Key issues identified include fragmented code implementations tied to specific benchmarks, absence of standardized interfaces, and challenges in reproducibility due to undocumented configuration dependencies. QueryGym addresses these gaps by providing a comprehensive framework for implementing and comparing LLM-based reformulation techniques, thus advancing the field of IR through improved reproducibility and reduced engineering overhead.

Figure 1: Inheritance hierarchy for the main classes in the QueryGym Python package.

Framework Design and Capabilities

QueryGym is a structured, modular toolkit designed to facilitate the development and testing of LLM-based query reformulation methods. The toolkit is organized into several key components:

Python API: Provides a standardized interface for integrating various LLM-based methods, simplifying their application across retrieval tasks.
Retrieval-Agnostic Interface: Supports seamless integration with different retrieval backends such as Pyserini and PyTerrier, allowing flexibility in IR pipeline configurations without the need to reimplement retrieval logic.
Centralized Prompt Management: Includes a version-controlled repository for prompt design and management, ensuring reproducibility and transparency in prompt engineering.
Benchmark Support: Natively supports datasets such as BEIR and MS MARCO, and allows for the incorporation of custom data through flexible loaders.
Open-Source Commitment: By maintaining an open-source framework, QueryGym encourages broader participation and innovation in LLM-based reformulation research.

Illustrative Use Cases

QueryGym is demonstrated across various use cases to highlight its application breadth and ease of integration:

Basic Query Reformulation: Allows rapid iteration over reformulation strategies using built-in reformulation methods and LLMs. The toolkit automates batch processing and result tracking, facilitating direct comparison of reformulated queries.
Contextual Reformulation with Retrieval Integration: Demonstrates QueryGym's capability to incorporate retrieval context through integration with retrieval engines. This setup supports reformulation methods requiring external context, leveraging existing IR tools seamlessly.
Benchmarking and Comparison: Enables systematic comparison of multiple reformulation methods across datasets using a controlled experimental setup. QueryGym's pipeline ensures uniform parameter management and method standardization.

Conclusions and Impact

QueryGym provides a robust, extensible environment for developing and benchmarking LLM-based query reformulation methods. Its design addresses challenges in modularity, reproducibility, and scalability, allowing researchers to explore advanced IR strategies efficiently. By fostering consistent usage and extensibility, QueryGym contributes significantly to structured experimentation in IR research, offering a reliable foundation for future developments in LLM-driven query reformulation. The toolkit can be accessed at its GitHub repository, promoting community involvement and future enhancements.