Sleep-time Compute: Beyond Inference Scaling at Test-time (2504.13171v1)

Published 17 Apr 2025 in cs.AI and cs.CL

Abstract: Scaling test-time compute has emerged as a key ingredient for enabling LLMs to solve difficult problems, but comes with high latency and inference cost. We introduce sleep-time compute, which allows models to "think" offline about contexts before queries are presented: by anticipating what queries users might ask and pre-computing useful quantities, we can significantly reduce the compute requirements at test-time. To demonstrate the efficacy of our method, we create modified versions of two reasoning tasks - Stateful GSM-Symbolic and Stateful AIME. We find that sleep-time compute can reduce the amount of test-time compute needed to achieve the same accuracy by ~ 5x on Stateful GSM-Symbolic and Stateful AIME and that by scaling sleep-time compute we can further increase accuracy by up to 13% on Stateful GSM-Symbolic and 18% on Stateful AIME. Furthermore, we introduce Multi-Query GSM-Symbolic, which extends GSM-Symbolic by including multiple related queries per context. By amortizing sleep-time compute across related queries about the same context using Multi-Query GSM-Symbolic, we can decrease the average cost per query by 2.5x. We then conduct additional analysis to understand when sleep-time compute is most effective, finding the predictability of the user query to be well correlated with the efficacy of sleep-time compute. Finally, we conduct a case-study of applying sleep-time compute to a realistic agentic SWE task.

Summary

The paper presents sleep-time compute, a method that leverages idle periods to pre-compute enriched context for faster, cost-efficient LLM inference.
It demonstrates that pre-computing context reduces test-time token usage by up to 5 times while maintaining comparable accuracy.
The approach effectively amortizes cost over multiple queries, making it ideal for persistent-context applications like document Q&A and coding agents.

This paper introduces sleep-time compute, a technique to reduce the latency and cost associated with scaling LLM inference at test-time, particularly for tasks involving persistent context (2504.13171). The core idea is to leverage the time when an LLM application is idle (between user interactions) to perform computations on the available context before a specific user query arrives. This pre-computation aims to make answering the subsequent query faster and cheaper.

Many LLM applications, such as document Q&A, coding agents, or conversational assistants, operate on a context (e.g., a document, codebase, conversation history) that exists before the user's next input (query). Standard test-time compute methods perform all reasoning only after receiving both the context ( $c$ ) and the query ( $q$ ), denoted $T_B(q, c) \rightarrow a$ , where $B$ is the test-time compute budget. This can lead to high latency and redundant computations if multiple queries relate to the same complex context.

Sleep-time compute introduces a pre-processing step: $S(c) \rightarrow c'$ . During the "sleep time", when only the context $c$ is known, the LLM is prompted to anticipate potential future queries, perform relevant inferences, summarize, or otherwise transform $c$ into a new, enriched representation $c'$ . This can involve standard test-time scaling techniques applied during the sleep phase. The paper implements this using prompts and specific function calls (rethink_memory, finish_rethinking_memory) that allow the model to iteratively update the context representation (Appendix \ref{app:implementation_details}).

At test-time, when the user query $q$ arrives, the model uses the pre-computed context $c'$ instead of the original $c$ : $T_b(q, c') \rightarrow a$ . Because $c'$ already contains useful pre-computed information, the test-time budget $b$ can often be much smaller than the original budget $B$ ( $b \ll B$ ) needed to achieve similar accuracy, thus reducing latency and cost. Furthermore, the cost of generating $c'$ can be amortized if multiple queries $q_1, q_2, \ldots, q_N$ are asked about the same original context $c$ .

Experimental Setup and Datasets

To evaluate sleep-time compute, the paper introduces modified versions of existing benchmarks:

Stateful GSM-Symbolic: Derived from GSM-Symbolic P1/P2 (math word problems) by splitting each problem into a context (initial statements) and a query (final question). Used with GPT-4o and GPT-4o-mini.
Stateful AIME: Derived from AIME 2024/2025 math competition problems, similarly split into context and query. Used with reasoning models like o1, o3-mini, Claude Sonnet 3.7 Extended Thinking, and Deepseek-R1.
Multi-Query GSM-Symbolic: Extends Stateful GSM-Symbolic by synthetically generating multiple related questions per context using o3-mini (Appendix \ref{app:multi-query-gsm8k-symbolic}), designed to test cost amortization.
SWE-Features: A new software engineering benchmark focusing on implementing new features that require multi-file edits in large repositories (Aider-AI/aider, ComfyUI). The context consists of related pull requests (PRs), and the query is the target PR description. Evaluation uses F1 score on the set of modified files.

Key Findings and Practical Implications

Improved Test-Time Efficiency: Sleep-time compute significantly improves the trade-off between test-time compute (measured in tokens) and accuracy. On Stateful GSM-Symbolic and Stateful AIME, it achieves accuracy comparable to the baseline (standard test-time compute) using approximately 5 times fewer test-time tokens (Figures \ref{fig:gsm8k-main-result}, \ref{fig:aime-main-results}). This translates directly to lower latency and potentially lower cost for user-facing queries. Sleep-time compute also generally outperforms parallel test-time scaling (pass@k) at similar test-time token budgets (Figures \ref{fig:gsm-pass-at-k}, \ref{fig:aime-pass-at-k}).
Benefits of Scaling Sleep-time Compute: Investing more compute during the sleep phase further enhances performance. Scaling sleep-time compute (by generating multiple context representations in parallel for non-reasoning models, or increasing reasoning effort for reasoning models) pushes the accuracy curve higher, yielding gains up to 13% on Stateful GSM-Symbolic and 18% on Stateful AIME for similar test-time budgets (Figures \ref{fig:async_scaling_plot}, \ref{fig:aime_async_scaling}). This suggests that applications can trade off offline processing cost for better online performance.
Cost Amortization: When multiple queries relate to the same context, the initial cost of sleep-time compute can be amortized. Using Multi-Query GSM-Symbolic and assuming test-time tokens are 10x more expensive than sleep-time tokens (due to latency optimization costs), the average cost per query can be reduced by up to 2.5x when there are 10 queries per context (Figure \ref{fig:amortization-results}). This is highly relevant for applications like document analysis or coding assistance where users ask follow-up questions.
Impact of Query Predictability: Sleep-time compute provides the most significant benefit when the upcoming user query is more predictable from the context. An analysis on Stateful GSM-Symbolic, using Llama2-70B log-probability to measure predictability, showed that the accuracy improvement from sleep-time compute over the baseline widens for more predictable queries (Figure \ref{fig:predictability_figure}). This implies that sleep-time compute is most valuable when the application can reasonably anticipate the type of information a user might request next.
Real-World Application (SWE-Features): The benefits extend to realistic agentic tasks. On the SWE-Features benchmark, sleep-time compute (where the agent explores the repo based on related PRs during sleep-time) improved the F1 score for implementing features, particularly at lower test-time compute budgets (Figure \ref{fig:repo_func_bench}). It achieved similar performance to the baseline with roughly 1.5x fewer test-time steps (tokens).

Implementation Considerations

Prompting: Sleep-time compute is guided by prompts instructing the model to process the context, make inferences, and anticipate useful information (Appendix \ref{app:prompts}, \ref{app:code_prompts}). At test time, prompts instruct the model to leverage the pre-computed information (rethink_memory_block).
Control over Compute: Test-time compute is scaled using prompts requesting different verbosity levels (for non-reasoning models) or API parameters/budget forcing (for reasoning models). Sleep-time compute is scaled using parallel generation or varying reasoning effort.
Trade-offs: There's a trade-off between the cost/effort of sleep-time compute and the gains at test-time. At very high test-time compute budgets, the baseline (only test-time compute) sometimes slightly outperformed sleep-time compute, possibly because the pre-computed context might contain irrelevant information for a specific query.
When to Use: Sleep-time compute is most suitable for stateful applications where context persists between interactions, where test-time latency/cost is a concern, and where future queries have some predictability based on the current context.

In summary, sleep-time compute offers a practical approach to mitigate the high latency and cost of advanced LLM reasoning by shifting computation from test-time to idle periods. It involves pre-processing context to create an enriched representation that allows for faster and more efficient query answering, with benefits demonstrated across mathematical reasoning and software engineering tasks.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/charlespacker/status/1914380650993569817

https://twitter.com/_akhaliq/status/1913162527455777235

https://twitter.com/iScienceLuvr/status/1913564617625985404

https://twitter.com/charlespacker/status/1934645241950400645

https://twitter.com/fly51fly/status/1913349727946707223

https://twitter.com/TheTuringPost/status/1915207367593165246