Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
120 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
3 tokens/sec
DeepSeek R1 via Azure Pro
55 tokens/sec
2000 character limit reached

Beyond Context Limits: Subconscious Threads for Long-Horizon Reasoning (2507.16784v1)

Published 22 Jul 2025 in cs.CL

Abstract: To break the context limits of LLMs that bottleneck reasoning accuracy and efficiency, we propose the Thread Inference Model (TIM), a family of LLMs trained for recursive and decompositional problem solving, and TIMRUN, an inference runtime enabling long-horizon structured reasoning beyond context limits. Together, TIM hosted on TIMRUN supports virtually unlimited working memory and multi-hop tool calls within a single LLM inference, overcoming output limits, positional-embedding constraints, and GPU-memory bottlenecks. Performance is achieved by modeling natural language as reasoning trees measured by both length and depth instead of linear sequences. The reasoning trees consist of tasks with thoughts, recursive subtasks, and conclusions based on the concept we proposed in Schroeder et al, 2025. During generation, we maintain a working memory that retains only the key-value states of the most relevant context tokens, selected by a rule-based subtask-pruning mechanism, enabling reuse of positional embeddings and GPU memory pages throughout reasoning. Experimental results show that our system sustains high inference throughput, even when manipulating up to 90% of the KV cache in GPU memory. It also delivers accurate reasoning on mathematical tasks and handles information retrieval challenges that require long-horizon reasoning and multi-hop tool use.

Summary

  • The paper presents the TIM model, which recursively decomposes complex tasks into subtask trees to preserve higher-level instructions.
  • It details TIMRUN, an inference runtime that prunes irrelevant subtasks to optimize memory use and improve throughput.
  • Experimental results show that the system maintains strong performance on STEM benchmarks while generalizing effectively to novel tasks.

Thread Inference Model (TIM) for Long-Horizon Reasoning

This paper introduces the Thread Inference Model (TIM) and its dedicated runtime, TIMRUN, designed to overcome the context length limitations of LLMs when applied to long-horizon reasoning tasks. TIM employs a novel approach of recursively decomposing complex tasks into subtasks, while TIMRUN manages memory efficiently by pruning irrelevant subtasks during inference.

Thread-2: Recursive Subtask Trees

The authors model reasoning trajectories as recursive subtask trees, where each node represents a task comprising a thought process, optional tool use, subtasks, and a conclusion. This structure, named Thread-2, improves upon the original Thread framework by ensuring that subtasks have access to the instructions of higher-level tasks, preventing information gaps and enabling end-to-end inference within a single LLM call. Figure 1

Figure 1: Latent information compression for all context tokens versus structural latent information compression focusing on the working memory enabled by parsing the reasoning trajectory.

A key aspect of Thread-2 is its subtask pruning mechanism, which reduces the complexity of the reasoning context by selectively retaining only the most relevant information. This is achieved using a subtask stack with a fixed size, where completed subtask lists are added to the stack, and the earliest subtask list is pruned from the working memory when the stack exceeds a defined threshold. The authors implement JSON decoding using schemas, which allows for efficient extraction of tool parameters. TIM waits until receiving tool responses in the reasoning runtime and extends its KV cache by encoding them as batches of new input tokens. An example of the JSON schema used for constrained decoding is shown in (Figure 2). Figure 2

Figure 2: The pydantic class we use to create the JSON schema for constrained decoding.

TIMRUN: Inference Runtime for Thread-2

TIMRUN is an inference runtime system specifically designed for the TIM model, addressing the deployment challenges posed by the Thread-2 reasoning framework. TIMRUN supports the reuse of GPU memory and positional embeddings for output generation, enabling long-horizon reasoning that exceeds the output limit of many LLMs. Subtask pruning is essential for efficiently implementing TIM and sustaining long-term reasoning. TIMRUN keeps a pruning buffer, a stack that temporarily caches a small set of prunable subtasks, retaining just enough redundancy to ensure lossless information flow. The subtask pruning process is shown in (Figure 3). Figure 3

Figure 3: While TIM is decoding the conclusion of task 2, tokens in task 1.1.1 and 1.1.2, including the enclosed tool call and response have been pruned from the KV-cache.

TIMRUN initiates tool calls directly within the runtime, extracting relevant parameters and forwarding the request to the external tool. This approach reduces inter-module communication, streamlining agent development and deployment. A comparison of the communications among clients, tools, and different inference runtimes is shown in (Figure 4). Figure 4

Figure 4: Comparing the communications among clients, tools, and different inference runtimes.

Experimental Results

The authors evaluated TIM models on MATH500, MMLU-STEM500, AMC 2022, AMC 2023, AIME 2024, and GPQADiamond to assess their STEM knowledge and reasoning abilities. The results demonstrate that subtask pruning in TIMRUN does not degrade overall performance, and in some cases, it improves performance by retaining only the most relevant information in the KV cache. The authors also evaluated TIM models on agentic research tasks, using the BrowseComp and Datacommons QA benchmarks, both requiring multi-hop information retrieval, processing of tool responses, and reasoning. The reported performance shows that the TIM model generalizes well to novel tasks not encountered during training, offering greater efficiency by eliminating the need for carefully crafted few-shot examples and task-specific prompts.

Experiments testing the efficiency of TIMRUN focused on memory management overhead and throughput. The results showed that TIMRUN delivers improved throughput, due to the balance it strikes.

Conclusion

The authors introduce a co-designed system consisting of the TIM model and its dedicated serving infrastructure, TIMRUN. Experiments show that generating a more concise KV cache increases inference throughput and enhances performance by helping the model focus on relevant context. The combination of TIM and TIMRUN delivers strong reasoning ability, more efficient inference and tool use, and greater flexibility and scalability for agentic tasks.

Reddit Logo Streamline Icon: https://streamlinehq.com