Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Sleep-Time Compute Paradigm

Updated 4 July 2025
  • Sleep-time compute is a paradigm that precomputes context insights offline to enable faster, cost-effective query responses.
  • It decomposes computation into an intensive sleep-time phase for summarization and a lightweight test-time phase for efficient inference.
  • Its application in conversational agents, QA systems, and coding tools shows up to 5x test-time savings and significant accuracy improvements.

Sleep-time compute is a computational paradigm in machine learning and artificial intelligence that leverages offline (“sleep”) periods to precompute inferences, representations, or other useful intermediates about a context before a specific user query arrives. In contrast to traditional test-time compute, where all reasoning occurs upon receipt of context and query together, sleep-time compute anticipates potential queries and processes context in advance. The result is a substantial reduction in test-time latency and inference cost, especially for systems where the context is persistent and queries are either predictable or numerous.

1. Conceptual Framework

Sleep-time compute decomposes the interaction with a predictive model or LLM into two temporally distinct computation phases:

  • Sleep-time phase: The system receives and processes the context cc, possibly anticipating the types of queries qq that may later be asked. A compute-intensive function S(c)S(c) is performed offline to produce a re-representation cc'—this might include generating summaries, inferred relationships, or candidate answers.
  • Test-time phase: When the actual query qq arrives, the model utilizes the precomputed cc' in a lightweight function Tb(q,c)T_b(q, c'), yielding an answer aa. Here, the test-time compute bb is significantly less than would be needed to process both cc and qq from scratch (as in the standard setup TB(q,c)T_B(q, c), with bBb \ll B).

This approach capitalizes on the fact that in many practical LLM or agentic settings, context changes slowly or is shared across multiple queries or users, while the set of likely queries is often predictable in form or subject.

2. Methodological Implementation

The core implementation consists of a two-phase workflow:

  • Offline (sleep-time) phase:
    • The system is provided the persistent context, such as a passage, program, document, or environment state.
    • It applies intensive reasoning and summarization—using multiple runs of prompt-based functions such as rethink_memory—to extract and store as much useful intermediate knowledge as possible.
    • This process is compute-intensive but not time-critical, as it occurs during idle/model sleep intervals, and the cost can be amortized across many queries.
  • Online (test-time) phase:
    • Upon receipt of a user query (which was not necessarily known in advance but is expected to be relevant to the context), the system combines the query with the preprocessed context representation for efficient inference.
    • The model then produces an answer using minimal test-time resources.

For example, in the paper’s experiments, reasoning benchmarks such as Stateful GSM-Symbolic and Stateful AIME were recast such that all background facts were included in the sleep-time phase, while the direct question was deferred until test-time.

3. Empirical Results and Performance

Sleep-time compute demonstrates substantial test-time resource savings and accuracy improvements under the studied benchmarks:

  • Test-time compute reduction: For both Stateful GSM-Symbolic and Stateful AIME, models using sleep-time compute required approximately 5x less test-time computation to achieve the same accuracy as models applying all compute at test-time.
  • Accuracy gains: Scaling the amount of sleep-time compute (e.g., by increasing the number or depth of candidate inferences about the context) can further increase final accuracy—by up to 13% on Stateful GSM-Symbolic and 18% on Stateful AIME.
  • Amortization in Multi-Query Settings: In a new multi-query variant of GSM-Symbolic (where multiple, related queries are asked about a shared context), sleep-time compute can be amortized across queries, lowering the average test-time cost by 2.5x.
  • Query predictability correlation: The utility of sleep-time compute is empirically correlated with the predictability of future queries given context. That is, the simpler it is to guess or generate the questions from the context, the more valuable precomputing context inferences becomes.

4. Applications and Use Cases

The paradigm is especially well-suited for scenarios such as:

  • Conversational assistants and chatbots: Where context (conversation history, user profile) persists and multiple follow-up questions are common.
  • Enterprise QA or document retrieval systems: Precomputing knowledge graphs or semantic parses for a corpus enables rapid query responses.
  • Software engineering agents: Preanalyzing codebases or recent changes (“PRs”) means LLM agents can answer code review or debugging queries more efficiently, as demonstrated in the SWE-Features benchmark case paper.
  • Educational assistants and math tutors: By anticipating typical queries about curricular content, sleep-time compute enables instant response without redundant reasoning.

A plausible implication is that the greater the overlap or coherence among queries on a context—the more sleep-time compute will benefit latency, aggregate cost, and even answer accuracy.

5. Limitations and Areas for Further Research

Several practical and theoretical limitations are acknowledged:

  • Limited benefit for unpredictable queries: If user queries cannot be anticipated or are unrelated to the persistent context, sleep-time computation affords little advantage.
  • Simplicity of context–query decomposition: Real-world LLM use often involves interactive, changing, or multi-turn contexts, complicating when and how to apply sleep-time compute.
  • Potential for irrelevant precomputation: Excess or unfocused offline computation may dilute the relevance of context representations, requiring improved methods for selection and compression.
  • Extensibility: The framework’s efficacy as context changes (e.g., context updates in ongoing sessions, incremental knowledge) warrants further investigation.

Research directions proposed include dynamic allocation strategies, better predictive models of query probability, and integrating sleep-time compute with synthetic data generation or lifelong learning paradigms.

6. Comparative Summary Table

Aspect Traditional Test-Time Compute Sleep-Time Compute
Compute location Entirely at user query time Bulk of context reasoning offline (idle time)
Latency High Low
Cost per query High (all reasoning per query) Low (amortized over context or multiple queries)
Scalability for repeated queries None High
Sensitivity to query predictability Low High
Example best-use scenarios One-shot, unpredictable QA FAQ, multi-turn chat, coding agents, etc.

Sleep-time compute provides a principled, resource-efficient, and latency-saving paradigm for contexts where user queries can be forecasted or are contextually related, enabling models to precompute and cache inferences during non-interactive periods. Empirical analyses show strong reductions in test-time cost and potential for improved accuracy, especially as the number or predictability of queries per context increases, with broad relevance for production and research systems that leverage ongoing context in language-based reasoning.