Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

157 tokens/sec

GPT-4o

8 tokens/sec

Gemini 2.5 Pro Pro

46 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Multi-Query GSM-Symbolic Benchmark

Updated 4 July 2025

Multi-Query GSM-Symbolic is a dataset paradigm that augments GSM-Symbolic by pairing each context with multiple, related queries to study precomputed reasoning.
It employs synthetic query generation to ensure consistent challenge and relevancy, enabling robust evaluation of sleep-time compute strategies.
Empirical results show up to a 2.5× reduction in per-query cost by leveraging precomputed context, enhancing efficiency in interactive language applications.

Multi-Query GSM-Symbolic is a dataset and experimental paradigm introduced to paper the interplay between “sleep-time compute” and LLM inference efficiency, specifically in scenarios where multiple queries are posed about a shared context. By extending the GSM-Symbolic benchmark with multiple related queries per context, Multi-Query GSM-Symbolic enables quantification of how pre-computed reasoning or representations—performed offline or “at sleep-time”—can be amortized over several user questions, optimizing both computational cost and real-world latency for LLM applications.

1. Definition and Motivation

Multi-Query GSM-Symbolic is defined as an augmentation of the GSM-Symbolic dataset where each context (a natural language scenario or passage) is paired with numerous related queries rather than just one. The construction facilitates empirical evaluation of “sleep-time compute” strategies, where a LLM “thinks” about or preprocesses the context before test-time queries arrive. The fundamental motivation is to paper how such pre-computed knowledge or representations (if reused across multiple queries) reduce the average test-time compute per query and to establish conditions under which this amortization yields substantial benefits.

The benchmark reflects realistic settings such as document question-answering, code analysis, and knowledge-based dialog, where a user or agent may issue several related queries about the same underlying information. The goal is to measure the degree to which an LLM’s heavy-lift reasoning about context can be shared over multiple queries, lowering latency and resource use per query for production deployments (2504.13171).

2. Dataset Construction and Methodology

Multi-Query GSM-Symbolic builds on stateful variants of GSM-Symbolic, initially by splitting each example into a static context and a primary (ground truth) query. For the multi-query extension, additional queries are synthetically generated for each context using a prompt template and a LLM (specifically, OpenAI’s o3-mini), with care to maintain complexity and scope commensurate with the original question.

Synthetic Generation: For each context, the LLM is prompted to “write as many equally difficult new questions as you can about the above passage that don't overlap with the original answer, and then answer each one.” This process creates a set of challenging, diverse, and contextually-relevant queries, mimicking plausible human or agentic question-asking behaviors.
Dataset Size: In the main experiments (P1), the dataset consists of 1,095 distinct contexts and 12,043 total questions (1,095 originals + 10,948 synthetic). In an additional partition (P2), there are 500 contexts and 5,497 questions.
Example Types: Each question associated with a context is validated to ensure similar reasoning steps and difficulty as the original, ensuring that cost and performance metrics across queries are comparable and representative.

3. Sleep-Time Compute and Amortization Strategy

The experimental paradigm is designed to isolate and measure the benefit of amortizing expensive context processing across multiple queries:

Sleep-Time Phase: The model receives the context and is allowed to precompute or “think ahead” in an unconstrained manner, producing a structured context representation or set of inferences (denoted as $c'$ ).
Test-Time Phase: For each query ( $q_1, q_2, ..., q_k$ ) relating to the same context, the model answers by leveraging $c'$ , ideally avoiding repetitive high-latency computation.
Performance Metric: The principal metric is average cost per query, formalized as

$\text{Avg. Cost per Query} = \frac{\text{Total Sleep-time Compute} + \text{Total Test-time Compute}}{\text{Number of Queries}}$

Token cost is weighted: in the experiments, test-time tokens are assigned a cost 10× that of sleep-time tokens to reflect latency and system cost asymmetry in practical deployments.

Result: With increasing numbers of queries per context, the amortization effect becomes more pronounced. For example, with 10 queries per context, the average per-query cost can be reduced by up to 2.5x compared to the single-query baseline.

4. Comparative Analysis and Baseline Methods

Multi-Query GSM-Symbolic supports direct comparison against several baseline answer-generation strategies:

Method	Description	Amortization Across Queries?	Typical Cost/Query
Sleep-time Compute	Preprocess context at sleep-time; reuse for all queries	Yes (<strong>Main focus</strong>)	2.5× lower at ≥10 queries
Standard Test-Time Compute	Each query processed independently at query time	No	Highest
Parallel Test-Time Scaling	Multiple answers sampled/tested in parallel, per query	No	High, requires validation
Context-Only Baseline	Model must “guess” answer from context (no explicit question)	Indirectly	Low accuracy

Empirical results show that sleep-time compute with Multi-Query GSM-Symbolic Pareto-dominates standard and parallel scaling approaches for the same accuracy, provided that queries are sufficiently related and predictable with respect to the context (2504.13171).

5. Practical Implications and Applications

The Multi-Query GSM-Symbolic paradigm has significant implications for the deployment and design of LLM-backed systems:

Interactive Systems: Applications such as document QA, coding tools, and conversational assistants frequently encounter users submitting multiple queries per shared background; sleep-time compute strategies can achieve substantial per-query cost and latency reduction in these settings.
Resource Allocation: By rebalancing inference workloads from test-time (latency-sensitive) to background compute (sleep-time), system throughput can be improved and operational costs optimized.
Generality: The techniques and benchmarks generalize to any task where context recurs across multiple queries, including search, support agents, and iterative analysis workflows.

A plausible implication is that as workloads become more “multi-query” in nature, organizations and platform providers may realize operational and economic advantages by investing in advance preprocessing strategies, mirroring sleep-time compute’s amortization properties observed in this benchmark.

6. Limitations and Future Directions

While Multi-Query GSM-Symbolic demonstrates strong amortization benefits, its effectiveness is modulated by several factors:

Query Predictability: When future queries are difficult to anticipate from the given context, the ability to reuse sleep-time inferences diminishes, reducing amortization gains.
Representation Strategies: The choice of how to represent and cache the precomputed information (e.g., in natural language vs. structured form) may impact both answer quality and efficiency.
Dynamic and Evolving Contexts: The current setup primarily addresses static context with a fixed set of related queries. Generalizing to dynamic, multi-turn, or evolving scenarios (e.g., conversation, streaming analytics) is a direction for future work.

Authors suggest that optimal allocation between sleep-time and test-time compute, better learning of query distributions, and the exploration of richer context representation strategies remain open areas for research and practical improvement (2504.13171).

7. Summary Table: Core Attributes of Multi-Query GSM-Symbolic

Aspect	Description
Definition	GSM-Symbolic dataset extension with multiple queries per context
Purpose	Empirical testbed for amortizing sleep-time compute over several related queries
Dataset Construction	Synthetically generated queries per passage using LLM prompting
Amortization Effect	Up to 2.5× reduction in average per-query cost with ≥10 queries/context
Comparative Baselines	Outperforms per-query test-time and parallel scaling under practical cost models
Limitation	Benefits decline with less predictable/unrelated queries
Applications	QA systems, agentic assistants, coding tools, document parsing where multi-query is prevalent
Future Directions	Adaptive compute allocation, query forecasting, richer context representations, dynamic context

Multi-Query GSM-Symbolic establishes a principled, empirically grounded foundation for evaluating and designing LLM-based systems under realistic “multi-question, shared-context” conditions, directly informing compute allocation policies and system architecture for efficient large-scale deployment.

PDF Markdown Chat (Upgrade)

References (1)

Sleep-time Compute: Beyond Inference Scaling at Test-time (2025)