Sleep-Time Compute Paradigm

Updated 4 July 2025

Sleep-time compute is a paradigm that precomputes context insights offline to enable faster, cost-effective query responses.
It decomposes computation into an intensive sleep-time phase for summarization and a lightweight test-time phase for efficient inference.
Its application in conversational agents, QA systems, and coding tools shows up to 5x test-time savings and significant accuracy improvements.

Sleep-time compute is a computational paradigm in machine learning and artificial intelligence that leverages offline (“sleep”) periods to precompute inferences, representations, or other useful intermediates about a context before a specific user query arrives. In contrast to traditional test-time compute, where all reasoning occurs upon receipt of context and query together, sleep-time compute anticipates potential queries and processes context in advance. The result is a substantial reduction in test-time latency and inference cost, especially for systems where the context is persistent and queries are either predictable or numerous.

1. Conceptual Framework

Sleep-time compute decomposes the interaction with a predictive model or LLM into two temporally distinct computation phases:

Sleep-time phase: The system receives and processes the context $c$ , possibly anticipating the types of queries $q$ that may later be asked. A compute-intensive function $S(c)$ is performed offline to produce a re-representation $c'$ —this might include generating summaries, inferred relationships, or candidate answers.
Test-time phase: When the actual query $q$ arrives, the model utilizes the precomputed $c'$ in a lightweight function $T_b(q, c')$ , yielding an answer $a$ . Here, the test-time compute $b$ is significantly less than would be needed to process both $c$ and $q$ from scratch (as in the standard setup $T_B(q, c)$ , with $b \ll B$ ).

This approach capitalizes on the fact that in many practical LLM or agentic settings, context changes slowly or is shared across multiple queries or users, while the set of likely queries is often predictable in form or subject.

2. Methodological Implementation

The core implementation consists of a two-phase workflow:

Offline (sleep-time) phase:
- The system is provided the persistent context, such as a passage, program, document, or environment state.
- It applies intensive reasoning and summarization—using multiple runs of prompt-based functions such as rethink_memory—to extract and store as much useful intermediate knowledge as possible.
- This process is compute-intensive but not time-critical, as it occurs during idle/model sleep intervals, and the cost can be amortized across many queries.
Online (test-time) phase:
- Upon receipt of a user query (which was not necessarily known in advance but is expected to be relevant to the context), the system combines the query with the preprocessed context representation for efficient inference.
- The model then produces an answer using minimal test-time resources.

For example, in the paper’s experiments, reasoning benchmarks such as Stateful GSM-Symbolic and Stateful AIME were recast such that all background facts were included in the sleep-time phase, while the direct question was deferred until test-time.

3. Empirical Results and Performance

Sleep-time compute demonstrates substantial test-time resource savings and accuracy improvements under the studied benchmarks:

Test-time compute reduction: For both Stateful GSM-Symbolic and Stateful AIME, models using sleep-time compute required approximately 5x less test-time computation to achieve the same accuracy as models applying all compute at test-time.
Accuracy gains: Scaling the amount of sleep-time compute (e.g., by increasing the number or depth of candidate inferences about the context) can further increase final accuracy—by up to 13% on Stateful GSM-Symbolic and 18% on Stateful AIME.
Amortization in Multi-Query Settings: In a new multi-query variant of GSM-Symbolic (where multiple, related queries are asked about a shared context), sleep-time compute can be amortized across queries, lowering the average test-time cost by 2.5x.
Query predictability correlation: The utility of sleep-time compute is empirically correlated with the predictability of future queries given context. That is, the simpler it is to guess or generate the questions from the context, the more valuable precomputing context inferences becomes.

4. Applications and Use Cases

The paradigm is especially well-suited for scenarios such as:

Conversational assistants and chatbots: Where context (conversation history, user profile) persists and multiple follow-up questions are common.
Enterprise QA or document retrieval systems: Precomputing knowledge graphs or semantic parses for a corpus enables rapid query responses.
Software engineering agents: Preanalyzing codebases or recent changes (“PRs”) means LLM agents can answer code review or debugging queries more efficiently, as demonstrated in the SWE-Features benchmark case paper.
Educational assistants and math tutors: By anticipating typical queries about curricular content, sleep-time compute enables instant response without redundant reasoning.

A plausible implication is that the greater the overlap or coherence among queries on a context—the more sleep-time compute will benefit latency, aggregate cost, and even answer accuracy.

5. Limitations and Areas for Further Research

Several practical and theoretical limitations are acknowledged:

Limited benefit for unpredictable queries: If user queries cannot be anticipated or are unrelated to the persistent context, sleep-time computation affords little advantage.
Simplicity of context–query decomposition: Real-world LLM use often involves interactive, changing, or multi-turn contexts, complicating when and how to apply sleep-time compute.
Potential for irrelevant precomputation: Excess or unfocused offline computation may dilute the relevance of context representations, requiring improved methods for selection and compression.
Extensibility: The framework’s efficacy as context changes (e.g., context updates in ongoing sessions, incremental knowledge) warrants further investigation.

Research directions proposed include dynamic allocation strategies, better predictive models of query probability, and integrating sleep-time compute with synthetic data generation or lifelong learning paradigms.

6. Comparative Summary Table

Aspect	Traditional Test-Time Compute	Sleep-Time Compute
Compute location	Entirely at user query time	Bulk of context reasoning offline (idle time)
Latency	High	Low
Cost per query	High (all reasoning per query)	Low (amortized over context or multiple queries)
Scalability for repeated queries	None	High
Sensitivity to query predictability	Low	High
Example best-use scenarios	One-shot, unpredictable QA	FAQ, multi-turn chat, coding agents, etc.

Sleep-time compute provides a principled, resource-efficient, and latency-saving paradigm for contexts where user queries can be forecasted or are contextually related, enabling models to precompute and cache inferences during non-interactive periods. Empirical analyses show strong reductions in test-time cost and potential for improved accuracy, especially as the number or predictability of queries per context increases, with broad relevance for production and research systems that leverage ongoing context in language-based reasoning.

PDF Markdown Chat (Upgrade)