Sleep-Time Compute Paradigm
- Sleep-time compute is a paradigm that precomputes context insights offline to enable faster, cost-effective query responses.
- It decomposes computation into an intensive sleep-time phase for summarization and a lightweight test-time phase for efficient inference.
- Its application in conversational agents, QA systems, and coding tools shows up to 5x test-time savings and significant accuracy improvements.
Sleep-time compute is a computational paradigm in machine learning and artificial intelligence that leverages offline (“sleep”) periods to precompute inferences, representations, or other useful intermediates about a context before a specific user query arrives. In contrast to traditional test-time compute, where all reasoning occurs upon receipt of context and query together, sleep-time compute anticipates potential queries and processes context in advance. The result is a substantial reduction in test-time latency and inference cost, especially for systems where the context is persistent and queries are either predictable or numerous.
1. Conceptual Framework
Sleep-time compute decomposes the interaction with a predictive model or LLM into two temporally distinct computation phases:
- Sleep-time phase: The system receives and processes the context , possibly anticipating the types of queries that may later be asked. A compute-intensive function is performed offline to produce a re-representation —this might include generating summaries, inferred relationships, or candidate answers.
- Test-time phase: When the actual query arrives, the model utilizes the precomputed in a lightweight function , yielding an answer . Here, the test-time compute is significantly less than would be needed to process both and from scratch (as in the standard setup , with ).
This approach capitalizes on the fact that in many practical LLM or agentic settings, context changes slowly or is shared across multiple queries or users, while the set of likely queries is often predictable in form or subject.
2. Methodological Implementation
The core implementation consists of a two-phase workflow:
- Offline (sleep-time) phase:
- The system is provided the persistent context, such as a passage, program, document, or environment state.
- It applies intensive reasoning and summarization—using multiple runs of prompt-based functions such as
rethink_memory
—to extract and store as much useful intermediate knowledge as possible. - This process is compute-intensive but not time-critical, as it occurs during idle/model sleep intervals, and the cost can be amortized across many queries.
- Online (test-time) phase:
- Upon receipt of a user query (which was not necessarily known in advance but is expected to be relevant to the context), the system combines the query with the preprocessed context representation for efficient inference.
- The model then produces an answer using minimal test-time resources.
For example, in the paper’s experiments, reasoning benchmarks such as Stateful GSM-Symbolic and Stateful AIME were recast such that all background facts were included in the sleep-time phase, while the direct question was deferred until test-time.
3. Empirical Results and Performance
Sleep-time compute demonstrates substantial test-time resource savings and accuracy improvements under the studied benchmarks:
- Test-time compute reduction: For both Stateful GSM-Symbolic and Stateful AIME, models using sleep-time compute required approximately 5x less test-time computation to achieve the same accuracy as models applying all compute at test-time.
- Accuracy gains: Scaling the amount of sleep-time compute (e.g., by increasing the number or depth of candidate inferences about the context) can further increase final accuracy—by up to 13% on Stateful GSM-Symbolic and 18% on Stateful AIME.
- Amortization in Multi-Query Settings: In a new multi-query variant of GSM-Symbolic (where multiple, related queries are asked about a shared context), sleep-time compute can be amortized across queries, lowering the average test-time cost by 2.5x.
- Query predictability correlation: The utility of sleep-time compute is empirically correlated with the predictability of future queries given context. That is, the simpler it is to guess or generate the questions from the context, the more valuable precomputing context inferences becomes.
4. Applications and Use Cases
The paradigm is especially well-suited for scenarios such as:
- Conversational assistants and chatbots: Where context (conversation history, user profile) persists and multiple follow-up questions are common.
- Enterprise QA or document retrieval systems: Precomputing knowledge graphs or semantic parses for a corpus enables rapid query responses.
- Software engineering agents: Preanalyzing codebases or recent changes (“PRs”) means LLM agents can answer code review or debugging queries more efficiently, as demonstrated in the SWE-Features benchmark case paper.
- Educational assistants and math tutors: By anticipating typical queries about curricular content, sleep-time compute enables instant response without redundant reasoning.
A plausible implication is that the greater the overlap or coherence among queries on a context—the more sleep-time compute will benefit latency, aggregate cost, and even answer accuracy.
5. Limitations and Areas for Further Research
Several practical and theoretical limitations are acknowledged:
- Limited benefit for unpredictable queries: If user queries cannot be anticipated or are unrelated to the persistent context, sleep-time computation affords little advantage.
- Simplicity of context–query decomposition: Real-world LLM use often involves interactive, changing, or multi-turn contexts, complicating when and how to apply sleep-time compute.
- Potential for irrelevant precomputation: Excess or unfocused offline computation may dilute the relevance of context representations, requiring improved methods for selection and compression.
- Extensibility: The framework’s efficacy as context changes (e.g., context updates in ongoing sessions, incremental knowledge) warrants further investigation.
Research directions proposed include dynamic allocation strategies, better predictive models of query probability, and integrating sleep-time compute with synthetic data generation or lifelong learning paradigms.
6. Comparative Summary Table
Aspect | Traditional Test-Time Compute | Sleep-Time Compute |
---|---|---|
Compute location | Entirely at user query time | Bulk of context reasoning offline (idle time) |
Latency | High | Low |
Cost per query | High (all reasoning per query) | Low (amortized over context or multiple queries) |
Scalability for repeated queries | None | High |
Sensitivity to query predictability | Low | High |
Example best-use scenarios | One-shot, unpredictable QA | FAQ, multi-turn chat, coding agents, etc. |
Sleep-time compute provides a principled, resource-efficient, and latency-saving paradigm for contexts where user queries can be forecasted or are contextually related, enabling models to precompute and cache inferences during non-interactive periods. Empirical analyses show strong reductions in test-time cost and potential for improved accuracy, especially as the number or predictability of queries per context increases, with broad relevance for production and research systems that leverage ongoing context in language-based reasoning.