Papers
Topics
Authors
Recent
Search
2000 character limit reached

RAG over Thinking Traces Can Improve Reasoning Tasks

Published 5 May 2026 in cs.IR, cs.AI, and cs.CL | (2605.03344v1)

Abstract: Retrieval-augmented generation (RAG) has proven effective for knowledge-intensive tasks, but is widely believed to offer limited benefit for reasoning-intensive problems such as math and code generation. We challenge this assumption by showing that the limitation lies not in RAG itself, but in the choice of corpus. Instead of retrieving documents, we propose retrieving thinking traces, i.e., intermediate thinking trajectories generated during problem solving attempts. We show that thinking traces are already a strong retrieval source, and further introduce T3, an offline method that transforms them into structured, retrieval-friendly representations, to improve usability. Using these traces as a corpus, a simple retrieve-then-generate pipeline consistently improves reasoning performance across strong models and benchmarks such as AIME 2025--2026, LiveCodeBench, and GPQA-Diamond, outperforming both non-RAG baselines and retrieval over standard web corpora. For instance, on AIME, RAG with traces generated by Gemini-2-thinking achieves relative gains of +56.3%, +8.6%, and +7.6% for Gemini-2.5-Flash, GPT-OSS-120B, and GPT-5, respectively, even though these are more recent models. Interestingly, RAG on T3 also incurs little or no extra inference cost, and can even reduce inference cost by up to $15%$. Overall, our results suggest that thinking traces are an effective retrieval corpus for reasoning tasks, and transforming them into structured, compact, or diagnostic representations unlocks even stronger gains. Code available at https://github.com/Narabzad/t3.

Summary

  • The paper introduces a novel two-stage RAG framework that integrates intermediate thinking traces to enhance reasoning accuracy in tasks like mathematics and program synthesis.
  • Transformed traces using T3 strategies yield substantial accuracy gains, with improvements up to +56.3% on benchmarks such as AIME and LiveCodeBench.
  • The approach reduces inference cost by up to 40% by leveraging high-quality, process-level signals instead of generic web-based corpora.

Retrieval-Augmented Generation over Thinking Traces for Reasoning-Intensive Tasks

Introduction

The paper "RAG over Thinking Traces Can Improve Reasoning Tasks" (2605.03344) scrutinizes the effectiveness of retrieval-augmented generation (RAG) methodologies, challenging the pervasive belief that RAG is fundamentally less suitable for reasoning-heavy tasks such as advanced mathematics and program synthesis. Rather than focusing on the retrieval mechanism itself, the work systematically interrogates the nature of the retrieval corpus, positing that traditional document-based retrieval fails to provide the "process-level" signals essential for complex reasoning. The central claim is that intermediate "thinking traces," representing the problem-solving trajectories of state-of-the-art LLMs, constitute an information source more aligned with the demands of reasoning tasks than generic web or textbook corpora.

Methodological Advances

The authors introduce a two-stage RAG framework centered on "thinking traces." In the offline stage, a strong LLM ("thinker") generates full reasoning traces for a large set of problems. These traces are then transformed using a lightweight model into multiple retrieval-friendly formats via T3 ("Transformation of Thinking Traces"). Three transformation strategies are formalized:

  • Structural Normalization: Enforces canonical, procedural step-by-step scaffolds.
  • Semantic Distillation: Produces traces at escalating abstraction levels, culminating in core insights.
  • Reflection: Highlights negative knowledge, common mistakes, and prevention strategies.

At inference, the main solver (potentially distinct LLM) retrieves k=3k=3 of these transformed traces most related to the query, providing the process-level context to the generation pipeline. The retrieval corpus is strictly decontaminated to prevent case overlap with evaluation sets.

Experimental Evidence

Benchmarks and Model Families

Experiments are conducted on the AIME 2025-2026 mathematics suite, LiveCodeBench (program synthesis), and GPQA-Diamond (graduate-level science QA). The evaluation spans high-capacity and diversity in model backbones (Gemini-2.5-Flash, GPT-OSS-120B, GPT-5).

Comparative Analysis

The core findings can be summarized as follows:

  • RAG over general-purpose corpora (Wikipedia, StackExchange, OpenWebMath, GitHub, ArXiv, or large datastores) yields inconsistent or even negative gains for reasoning benchmarks, often contaminating generations with irrelevant or misleading context regardless of retrieve-and-generate infrastructure.
  • RAG over raw thinking traces universally improves performance across all models and tasks. The magnitude is model-dependent but unambiguously positive, with gains on AIME for Gemini-2.5-Flash from 53.3% to 80.0% (+50.1%), GPT-OSS-120B from 78.3% to 85.0% (+8.6%), and GPT-5 from 86.7% to 91.7% (+5.8%)โ€”despite smaller corpus size (โˆผ59K traces) compared to web-scale baselines.
  • Transformed traces via T3 confer the strongest and most robust improvements. On AIME, T3-Semantic achieves +56.3% for Gemini-2.5-Flash, T3-Reflect reaches 93.3% for GPT-5 (+7.6%), and similar trends are observed on GPQA-Diamond and LiveCodeBench. Notably, the effect is most pronounced for weaker solvers, but persists for frontier models.
  • Retrieval over thinking traces outperforms output-only retrieval, signifying the importance of explicit process-level information.

Cost-Accuracy Trade-offs

Through cost analysis, the authors demonstrate that T3-based retrieval, by reducing trace verbosity and focusing on context-relevance, empirically reduces inference cost (input + output token count) by as much as 15% compared to No RAG for high-end models, and by 30โ€“40% versus full-trace RAG. In some cases, higher accuracy is achieved at reduced cost.

  • Quality of traces supersedes corpus size: Corpora generated by superior "thinker" models (e.g., Gemini-2-thinking) yield stronger downstream performance than larger, less coherent corpora.
  • Transformation amplifies transferability: Even solvers unlike the thinker model benefit when context units are adequately structured.

Implications

Theoretical Significance

This work recasts the core bottleneck in reasoning-oriented RAG as a corpus design problem rather than purely a retrieval or integration bottleneck. It sharpens the distinction between knowledge-oriented grounding (for which document corpora suffice) and trajectory-oriented guidance fundamental to mathematical and agentic reasoning. The demonstrated generality of transformed thinking traces as retrieval units, transferable across model families, suggests decoupling parameter-based distillation from inference-time guidance mechanismsโ€”a modular alternative to full model retraining or continual finetuning.

The approach also offers a platform for studying reasoning transfer, error correction, and process plagiarism, leveraging large corpora of structured solution attempts.

Practical Consequences

  • Retrieval-augmented reasoners can achieve strong cost-accuracy Pareto optimality, especially for computationally constrained or mid-capacity solvers.
  • Code, mathematics, and science domains benefit most, where explicit procedural heuristics or common error patterns are critical.
  • Corpus design emerges as a new axis for improving reasoning agents, suggesting the need for pre-competitive or consortium-based repositories of high-quality traces alongside benchmarks.

Future Prospects

Multiple directions present themselves:

  • Task-adaptive transformation: Exploring retrieval-time or adaptive transformations to further match context with query requirements.
  • Corpus diversification: Incorporating challenging multi-modal and open-ended reasoning tasks to generalize the approach.
  • Iterative and agentic retrieval schemes: Beyond vanilla (single-step) RAG, pathways exist for iterative, stepwise retrieval or integrating reasoning trace retrieval with tools and planning agents.

Conclusion

The study decisively reframes the limitations of retrieval-augmented generation for reasoning as a question of corpus alignment, not retrieval mechanism. By utilizing and transforming thinking traces as first-class retrieval objects, significant accuracy and efficiency gains are realized across a spectrum of LLMs and reasoning-intensive tasks, sharply outperforming web-scale document retrieval baselines. The results support treating process-level solution data as a critical resource for the future construction and deployment of reasoning-competent LLM agents.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

What is this paper about?

This paper asks a simple question with a big twist: Can AI solve math, science, and coding problems better if, instead of looking up facts on the web, it looks up how other AIs thought through similar problems? The authors show that the answer is yes. They call these step-by-step solution attempts โ€œthinking traces,โ€ and they build a system that retrieves helpful traces to guide a modelโ€™s reasoning. They also introduce T3, a way to clean and shrink those traces so theyโ€™re easier to reuse.

What questions did the researchers ask?

  • If we use Retrieval-Augmented Generation (RAG)โ€”which means โ€œlook things up before answeringโ€โ€”but retrieve thinking steps (not regular web pages), does reasoning improve?
  • Can we transform raw thinking traces into clearer, shorter โ€œhow-toโ€ guides that help even more?
  • Do these ideas work across different tasks (math, science questions, coding) and different AI models?
  • Can this approach even lower the cost of running the AI?

How did they do it? (With simple analogies)

Think of an AI taking a test. Before answering a new question, itโ€™s allowed to quickly flip through a small โ€œlibraryโ€ of worked solutions from past problems. But instead of reading full textbooks (web pages), it skims other studentsโ€™ scratch workโ€”how they thought, what steps they tried, what mistakes they avoided. Thatโ€™s the main idea.

Here are the key pieces, explained in everyday language:

  • Retrieval-Augmented Generation (RAG): Before answering, the AI searches a library for helpful material to read. Traditionally, this library is made of web articles or documents. Here, the library is made of thinking tracesโ€”step-by-step solution attempts from other problems.
  • Thinking traces: These are like a studentโ€™s scratch paper. They include the steps, key ideas, and sometimes the wrong turns taken while solving a problemโ€”not just the final answer.
  • T3 (Transformation of Thinking Traces): Raw scratch work can be long and messy, so T3 turns it into tidy, quick-to-use notes. The paper uses three styles:
    • Struct (Structural Normalization): Turns a messy solution into a clean, numbered โ€œrecipeโ€ of stepsโ€”like a well-formatted lab procedure.
    • Semantic (Semantic Distillation): Boils down the core ideaโ€”โ€œthe main trickโ€โ€”without all the low-level details, like a one-paragraph strategy summary.
    • Reflect (Reflection): A guide to mistakes and fixesโ€”what people usually do wrong and how to avoid it, plus a short note on the correct approach. Think โ€œcommon pitfallsโ€ and โ€œpro tips.โ€
  • Chunking: Breaking long traces into smaller pieces so the AI retrieves only the most relevant parts. This reduces noise and distraction.
  • Benchmarks (the โ€œtestsโ€ they used):
    • AIME 2025โ€“2026: Very challenging math competition problems.
    • GPQA-Diamond: Tough graduate-level science questions.
    • LiveCodeBench: Programming problems that test code generation.

The pipeline works like this:

  1. Offline, a strong model (a good โ€œthinkerโ€) solves lots of practice problems and produces thinking traces.
  2. A smaller model uses T3 to clean and compress those traces into a retrieval-friendly library.
  3. At test time, when a new question arrives, a retriever grabs the top few most relevant trace snippets.
  4. The solver model reads those hints and writes the final answer.

The โ€œthinker,โ€ the โ€œtransformer,โ€ and the โ€œsolverโ€ can be different models, which makes the library reusable across systems.

What did they find, and why is it important?

Here are the main takeaways:

  • Retrieving thinking traces beats retrieving web pages for reasoning. For hard math (AIME), pulling in thinking traces gave large, consistent gainsโ€”often much bigger than using standard web or textbook corpora. Even very strong models got better with trace retrieval.
  • Cleaned-up traces (T3) work even better than raw traces. Because raw scratch work can be long and noisy, T3โ€™s tidy formats (Struct, Semantic, Reflect) help the solver focus on what matters. Which format is best depends on the task:
    • Math often benefits from Reflect or Semantic (avoid common traps + capture the key idea).
    • Science and code also gain, though improvements are smaller than math.
  • Quality matters more than quantity. Traces created by a stronger โ€œthinkingโ€ model helped more than having a larger pile of lower-quality traces. In other words, better worked examples beat just having more of them.
  • Seeing the steps helps more than seeing only final answers. Retrieving full reasoning (the โ€œhowโ€) is usually more helpful than retrieving just the final outputs from similar problems.
  • It can even save money. Sometimes, giving the model the right hints up front means it writes less, which can reduce inference cost (in the best case reported, up to about 15% cheaper per question). In general, T3 gave the best balance of accuracy and cost.

Why this matters: Many people believed retrieval helps when you need facts, but not when you need to reason. This paper shows the limit wasnโ€™t retrievalโ€”it was the kind of stuff we retrieved. If you retrieve the right โ€œthinking help,โ€ reasoning improves.

What could this change in the real world?

  • Smarter study helpers and tutors: Imagine an AI math tutor that doesnโ€™t just show answers, but retrieves short, clear โ€œhow-toโ€ strategies and common mistakes to avoid for a new problem.
  • Better AI tools for hard tasks: Coding assistants and scientific Q&A systems could become more reliable by reusing the best past reasoning patterns, not just documentation.
  • Reusable reasoning libraries: Teams can build shared libraries of thinking traces that help many models, including newer or smaller ones. This is like handing down high-quality study notes to the next class.
  • Lower costs (sometimes): If the hints are strong, the AI can do less trial-and-error, which can cut token usage and save money.
  • A new research direction: Instead of only training models to internalize reasoning, we can store, clean, and retrieve reasoning externally. That opens up safer, more controllable systemsโ€”because you can inspect, curate, and improve the library over time.

Final note: While results are strongโ€”especially for mathโ€”gains vary by task and model, and picking the right transformation (Struct, Semantic, or Reflect) matters. Still, the big idea holds: giving AI access to good โ€œworked examples of thinking,โ€ not just facts, can significantly improve how well it reasons.

Knowledge Gaps

Knowledge Gaps, Limitations, and Open Questions

Below is a concise, actionable list of what remains uncertain or unexplored in the paper and would benefit from targeted follow-up research:

  • Corpus scaling laws: How do gains scale with the number of thinking traces beyond 59kโ€“114k? Identify saturation points, diminishing returns, and optimal corpus sizes per task.
  • Domain coverage: The trace corpus is math-heavy; quantify how performance changes with domain-balanced or domain-specific trace corpora for science QA, code, and other domains.
  • Task generality: Evaluate on broader reasoning tasks (multi-hop commonsense, planning/agents, theorem proving, formal logic, program synthesis with tool use, multi-turn tutoring).
  • Cross-modal reasoning: Assess whether thinking traces help multimodal reasoning (e.g., geometry with diagrams, charts, code + execution traces).
  • Cross-lingual transfer: Do traces in one language benefit queries in another? What representations enable cross-lingual reuse?
  • Trace quality assessment: Develop automatic metrics/filters to detect and discard low-quality or erroneous reasoning traces prior to indexing.
  • Robustness to noisy/misleading traces: Quantify the impact of incorrect or adversarially crafted traces on solver accuracy; design defenses (e.g., trace consistency checks).
  • Data poisoning risk: Analyze and harden the pipeline against malicious trace injection at index time (security model, provenance, signatures).
  • Decontamination sufficiency: A 13-gram Jaccard threshold may miss paraphrased or step-level leaks; study stronger semantic decontamination and step-structure similarity filters.
  • Contamination auditing: Provide per-problem audits of retrieved traces to ensure no near-duplicate solutions or verbatim reasoning from evaluation items slipped through.
  • Retriever choice and training: Only e5-base was used; benchmark specialized retrievers (Contriever, ColBERT, GTR, multi-vector, hybrid sparse+dense) and fine-tune retrievers on reasoning queries.
  • Retrieval hyperparameters: Systematically ablate top-k, chunk size, and passage granularity (e.g., step-level indexing vs fixed-length chunks) across tasks.
  • Query-aware transformation: T3 is offline and query-agnostic; explore lightweight, on-the-fly, query-conditioned rewrites or selective expansion of retrieved traces.
  • Transformation model quality: Only one small model (Gemini-2-Flash-Lite) was used; study how transformation quality and model family affect downstream gains and cost.
  • Transformation selection policy: Different T3 variants (Struct/Semantic/Reflect) win on different tasks; learn a per-query selector or mixture-of-transformations policy.
  • Compression vs fidelity: Characterize the trade-off between trace compression level and utility; identify minimal information needed for consistent gains.
  • Provenance and interpretability: Record and surface source thinker, confidence, and validation signals with traces to let solvers weight or ignore context appropriately.
  • Cost accounting completeness: Token-cost comparisons omit retrieval/index latency, memory footprint, and network overhead; include end-to-end wall-time and system cost.
  • Fair compute comparisons: Normalization across conditions (e.g., generation token budgets, number of samples) needs stricter control to isolate RAG effects.
  • Prompt-format sensitivity: Study how hint placement, formatting, and instruction phrasing affect solver uptake of retrieved traces and cost.
  • Solverโ€“trace alignment: Analyze how model family, decoding settings, and CoT prompting modulate benefits; some models showed cost increasesโ€”why?
  • Combination with iterative RAG: Only retrieve-then-generate was tested; evaluate stepwise retrieval, planner-critic loops, and verifier-guided retrieval integration.
  • Knowledge vs process retrieval: For code and science tasks that need facts/APIs, investigate hybrid retrieval that mixes factual documents with thinking traces.
  • Error taxonomy and avoidance: Reflect helps in a case study; build a systematic library of common error patterns per domain and quantify their effect at scale.
  • Failure modes where RAG hurts: Identify problem categories where traces distract or mislead; develop gating mechanisms to abstain from retrieval.
  • Transfer limits across thinkers: Beyond three thinkers, probe transfer across wider families and model sizes; identify characteristics of โ€œportableโ€ traces.
  • Coverage estimation: Define and compute โ€œreasoning coverageโ€ of the trace index for a target distribution; design active curation to close coverage gaps.
  • Online updating: Explore mechanisms to add newly observed traces safely at runtime (experience replay) without catastrophic drift or contamination.
  • Evaluation breadth: Beyond exact-match and pass@1, assess faithfulness of reasoning, chain correctness, and calibration under trace-augmented decoding.
  • per-subdomain analysis: Break down gains by math topic (algebra/geometry/combinatorics), science field (physics/chem/biology), and code category (algorithms/systems/ML).
  • Cross-lingual/code-switch prompts: Test robustness when queries contain mixed languages, symbols, or non-ASCII math notation.
  • Long-context limits: Investigate behavior under very long problems and multi-step chains near context limits; study summarization-and-retrieval hybrids.
  • Licensing and privacy: Clarify legal/ethical implications of releasing traces derived from proprietary models/datasets and handling traces with potential sensitive content.
  • Reproducibility risks: Hosted API models evolve; fix versions/seeds and release all prompts/indices to ensure results can be replicated over time.
  • Benchmark diversity: Add larger, more varied, and privately sourced tasks to validate external generalization and reduce overfitting to public benchmarks.
  • Theoretical grounding: Develop a formal account of when and why process-level retrieval should help (e.g., biasโ€“variance trade-offs, inductive bias alignment).
  • Multi-agent settings: Test whether shared trace indices enable coordination and faster convergence for agentic teams or tool-using systems.

Practical Applications

Immediate Applications

Below are concrete, deployable use cases that can be built now by leveraging thinking-trace RAG and the T3 transformations (Struct, Semantic, Reflect). Each item includes suggested tools/workflows and key dependencies to consider.

  • Coding copilot boost in IDEs (software)
    • What: Improve pass rates and reduce hallucinations in code generation by retrieving structured reasoning traces for similar tasks (e.g., algorithm patterns, debugging steps). Use T3-Struct for reusable scaffolds and T3-Reflect to surface common pitfalls.
    • Tools/workflows: VS Code/JetBrains extensions; a โ€œTraceDBโ€ index built from high-quality code traces (e.g., LiveCodeBench-like corpora); e5/Contriever retriever; top-k insertion into the prompt.
    • Assumptions/dependencies: Availability of decontaminated code reasoning traces; licensing/IP for traces; strong retriever quality; IDE context window limits.
  • Math and STEM tutoring assistants (education)
    • What: Deliver step-by-step hints, high-level insights, and โ€œwhat to avoidโ€ tips by retrieving thinking traces for similar problems. T3-Semantic provides concise insights; T3-Reflect helps with misconception diagnosis.
    • Tools/workflows: LMS/LXP integration; hint-augmented prompts; student-progress tagging to select appropriate โ€œTraceCards.โ€
    • Assumptions/dependencies: Age/level-aligned trace corpora; pedagogy-aligned hinting; privacy when storing student attempts.
  • Internal troubleshooting copilots for IT and DevOps (software/IT ops)
    • What: Retrieve structured postmortem-like reasoning traces for outages or build failures to guide triage. T3-Reflect highlights common misdiagnoses; T3-Struct provides action checklists.
    • Tools/workflows: Integration with ticketing (Jira/ServiceNow) and logs (ELK, Datadog); per-incident trace capture and transformation pipeline; RAG on top of past incidents.
    • Assumptions/dependencies: Secure storage for sensitive operational data; de-identification; retriever tuned to logs/error signatures.
  • Customer support copilot with diagnostic playbooks (services)
    • What: Shorten time-to-resolution by retrieving diagnostic reasoning traces from resolved tickets for similar symptoms. T3-Struct for SOP-like steps; T3-Reflect to prevent repeated errors.
    • Tools/workflows: CRM/CS platforms (Zendesk, Salesforce) plugins; symptom-to-trace retrieval; โ€œquick checksโ€ cards in the agent UI.
    • Assumptions/dependencies: Sufficient volume of high-quality resolved cases; privacy/PII handling; domain adaptation across products.
  • Data science/analytics assistants (software/enterprise analytics)
    • What: Retrieve prior reasoning over similar analyses (e.g., cohort definitions, metric anomalies, causal checks), improving reproducibility and speed.
    • Tools/workflows: Jupyter/Notebooks integration; โ€œAnalysis TraceStoreโ€ indexing SQL, Python, and narrative rationale; T3-Semantic to compress insights.
    • Assumptions/dependencies: Access to validated analytics write-ups; governance for metric definitions; versioning of datasets.
  • Scientific literature Q&A assistants (academia/science)
    • What: Use T3-Struct traces to guide multi-step reasoning for graduate-level Q&A (GPQA-like), complementing citation retrieval with process-level guidance.
    • Tools/workflows: Hybrid RAG (papers + thinking traces); top-k trace blending; answer normalization and verification scripts.
    • Assumptions/dependencies: High-quality, domain-matched traces; prevention of overreliance on outdated heuristics; careful decontamination.
  • Cost-optimized LLM workflows in enterprises (cross-industry)
    • What: Reduce inference cost by shifting some compute into compact retrieved context (observed up to ~15% cheaper for certain models) while maintaining or improving accuracy.
    • Tools/workflows: Cost-aware routing to T3 vs. no-RAG; pricing-aware prompt shaping (input vs. output token ratios); A/B testing.
    • Assumptions/dependencies: Model pricing regimes where input is cheaper than output; retrieval precision to avoid irrelevant context that inflates input tokens.
  • Recruitment and assessment tools for problem solving (HR/education)
    • What: Provide candidates with โ€œhint packsโ€ derived from T3-Reflect traces that surface common pitfalls in logic/math/coding tasks without revealing full solutions.
    • Tools/workflows: Assessment platforms plug-in; dynamic hint difficulty control; per-item decontamination checks.
    • Assumptions/dependencies: Valid alignment with test integrity; curated trace sets that donโ€™t leak answers.
  • Auditable reasoning for AI evaluation (policy/ML governance)
    • What: Store and retrieve thinking traces used during inference to support audit trails and error analysis; compare performance with vs. without traces.
    • Tools/workflows: โ€œTraceLedgerโ€ for versioned storage; standardized metadata (model, date, domain, decontamination hash); evaluation harness integration.
    • Assumptions/dependencies: Organizational policy for trace retention; privacy and compliance controls; clear contamination policy.

Long-Term Applications

These opportunities likely require further research, larger-scale trace corpora, domain-specific governance, or tooling maturation.

  • Clinical decision support with process-aware guidance (healthcare)
    • What: Retrieve diagnostic reasoning pathways (differentials, red flags, โ€œdo-not-missโ€ checks) via T3-Reflect/Struct to assist clinicians.
    • Tools/workflows: EHR-integrated trace retrieval; verifiable pathways with references; guardrails for scope-of-practice.
    • Assumptions/dependencies: Clinically validated, peer-reviewed traces; regulatory approval; robust de-identification; liability frameworks.
  • Compliance and risk analysis copilots (finance/legal)
    • What: Retrieve argument structures, precedent-based reasoning, and common failure modes for regulatory interpretations or policy audits.
    • Tools/workflows: Domain-specific trace ontologies; chain-of-reasoning provenance; dual retrieval (statutes/precedents + traces).
    • Assumptions/dependencies: Up-to-date legal corpora; jurisdiction-aware decontamination; explainability standards.
  • Autonomous agents with reusable thought memory (robotics/agents)
    • What: Equip agents with a โ€œThought Memoryโ€ that retrieves prior task trajectories (planning, tool-use sequences, failure recovery) to improve long-horizon reliability.
    • Tools/workflows: Thought graph index (building on T3 + Retrieval-of-Thought concepts); dynamic assembly of templates; reflective updates post-execution.
    • Assumptions/dependencies: Safety evaluation in real environments; task generalization; latency constraints for on-device retrieval.
  • Industry playbooks as structured reasoning libraries (manufacturing/energy)
    • What: Convert tacit expert procedures into T3-Struct traces for maintenance, safety checks, and optimization; retrieve by equipment state and sensor signatures.
    • Tools/workflows: IoT telemetry-to-trace retrieval; procedural verification; multilingual trace normalization.
    • Assumptions/dependencies: Data partnerships; union/safety rules; continuous updates to reflect equipment changes.
  • National/sectoral โ€œtrace commonsโ€ (policy/standards)
    • What: Public repositories of decontaminated thinking traces for education, research, and benchmarking; standardized metadata and sharing protocols.
    • Tools/workflows: Open โ€œTrace Commonsโ€ API; auditing and decontamination services; licensing regimes for trace reuse.
    • Assumptions/dependencies: Incentives for contribution; governance for misuse prevention; quality control and bias monitoring.
  • Marketplace and exchange for domain traces (platforms)
    • What: Curated, licensed trace bundles (e.g., tax preparation reasoning, lab protocols, cybersecurity triage) sold to organizations to upgrade their RAG stacks.
    • Tools/workflows: Quality scoring; contamination risk guarantees; vertical-specific retrieval adapters.
    • Assumptions/dependencies: IP clarity; contractual compliance; measurable ROI vs. in-house trace generation.
  • Curriculum design and adaptive learning with misconception libraries (education)
    • What: At-scale T3-Reflect libraries of common student errors to drive adaptive practice sequences and feedback generation.
    • Tools/workflows: Alignment with standards (e.g., CCSS, NGSS); teacher dashboards showing misconception heat maps; controlled hinting policies.
    • Assumptions/dependencies: Longitudinal data; fairness and accessibility; teacher adoption and training.
  • Safety, alignment, and oversight tools (AI safety)
    • What: Use trace retrieval to detect faulty reasoning patterns, steer models away from dangerous chains, and document mitigations.
    • Tools/workflows: โ€œRed-flag Reflectโ€ traces for high-risk domains; alignment policies encoded as retrievable checks; sandboxed evaluation suites.
    • Assumptions/dependencies: Consensus on risky patterns; continuous red-teaming; compatibility with frontier modelsโ€™ context limits.
  • Enterprise โ€œreasoning memoryโ€ mesh (cross-industry)
    • What: Organization-wide index of validated reasoning traces across departments (engineering, ops, sales, finance) to avoid rediscovering solutions and propagate best practices.
    • Tools/workflows: Federated trace indexing; access control; lineage tracking; KPI-linked feedback loops (which traces helped).
    • Assumptions/dependencies: Data silos and governance; change management; security and compliance.
  • On-device/private trace caches (privacy-first applications)
    • What: Personal or edge caches of thinking traces for offline or privacy-sensitive use (e.g., medical note drafting, personal finance planning).
    • Tools/workflows: Lightweight retrievers; compression-focused T3 variants; periodic secure sync.
    • Assumptions/dependencies: Efficient on-device retrieval; secure enclaves; differential privacy for shared improvements.

Cross-cutting assumptions and dependencies

Before adopting these applications, consider the following factors that influence feasibility and ROI:

  • Trace quality over quantity: Results show stronger โ€œthinkerโ€ models produce more useful traces than simply scaling corpus size.
  • Domain match: Math-heavy corpora transfer partially to code/science; domain-specific traces improve gains and reliability.
  • Decontamination and leakage: Rigorous n-gram/semantic filters to avoid test contamination, IP leakage, or answer leakage in assessments.
  • Retrieval precision: Chunking and transformation (T3) often outperform full raw traces; poor retrieval can negate benefits.
  • Cost model sensitivity: Benefits are largest when input tokens are cheap relative to output tokens; monitor vendor pricing and context-window costs.
  • Governance, privacy, and IP: Storing and sharing reasoning traces may expose sensitive logic or PII; require policy, contracts, and auditing.
  • Human factors: In education and high-stakes domains, ensure traces align with pedagogy, regulations, and expert workflows; provide override and verification steps.
  • Tooling maturity: Productionizing needs MLOps for trace generation, transformation, indexing, evaluation, and drift monitoring.

Glossary

  • Ablation: A controlled analysis that varies or removes components to measure their effect on performance. "We further provide an ablation on the number of retrieved documents in Appendix D, showing that retrieving three documents yields the most stable performance."
  • Average@4: An evaluation setting reporting the average score across four independent samples per query. "For larger benchmarks such as GPQA-Diamond and LiveCodeBench, we use 4 samples per query and report Average@4."
  • Average@8: An evaluation setting reporting the average score across eight independent samples per query. "For AIME, where the benchmark is small, we use 8 samples per query and report the average across them (Average@8)."
  • Chain-of-thought: A prompting technique that elicits explicit, step-by-step reasoning from a model. "Prior work has improved reasoning through prompting strategies such as chain-of- thought (Wei et al., 2022; Wang et al., 2022)"
  • Chunked trajectories: Reasoning traces split into fixed-length segments for retrieval and context efficiency. "we compare retrieval over full trajectories... with chunked trajectories, where traces are split into fixed-length segments of 512 tokens."
  • Contrastive form: A representation emphasizing differences (e.g., between mistakes and correct approaches) to guide reasoning. "This strategy rewrites a reasoning trace in a contrastive form focused on mistakes and how to avoid them."
  • Contamination: Unwanted overlap between training/retrieval data and evaluation queries that can inflate measured performance. "Because the trace corpus is built from a fully separate auxiliary problem set and generated by different models than those used at inference time, it also remains cleanly separated from the evaluation queries and reduces the risk of contamination."
  • Contriever: A dense retrieval model used to index and serve large corpora for neural search. "DS- Serve (Liu et al., 2026), which serves the full corpus with Contriever (Izacard et al., 2021), enabling a comparison under larger-scale retrieval with a different retriever."
  • Cost-accuracy trade-off: The balance between inference expense (e.g., tokens, compute) and achieved accuracy. "Figure 1 summarizes the average cost-accuracy trade-off across the three benchmarks."
  • Datastore: A large indexed store of tokenized data used to support retrieval at inference time. "Shao et al. (2024) show that increasing datastore size can improve retrieval- based LLMs and introduce MassiveDS, a 1.4T-token datastore for studying inference-time scaling."
  • Decontamination: The process of removing or filtering data that overlaps with test sets to prevent leakage. "Decontamination. Following prior work (Borgeaud et al., 2022; Lyu et al., 2025), we decontaminate both collections against the evaluation benchmarks by removing samples whose similarity to an evaluation query exceeds a 13-gram Jaccard threshold."
  • e5-base: A sentence/embedding encoder used as the primary retriever for queries and documents. "For retrieval, we use e5-base as our primary encoder for both queries and thinking traces to retrieve top-3 documents."
  • EleutherAI LM Evaluation Harness: A standardized toolkit for benchmarking LLMs across tasks. "We evaluate retrieval-augmented reasoning using the EleutherAI LM Evalua- tion Harness (Gao et al., 2024) with custom task definitions."
  • Exact-match accuracy: A metric counting a prediction as correct only if it exactly matches the gold answer. "For AIME and GPQA-Diamond, we report exact-match accuracy."
  • Factual grounding: Supplying verified information in the prompt to anchor generation and reduce errors. "RAG has become a standard approach for improving LLMs on knowledge-intensive tasks by retrieving external documents that provide factual grounding and reduce hallucinations"
  • Frontier models: The most advanced, contemporary LLMs. "We run extensive experiments across multiple frontier models, including GPT-OSS-120B (OpenAI Team, 2025a), GPT-5 (OpenAI Team, 2025b), and Gemini-2.5-Flash (Gemini Team, 2023)"
  • Hallucinations: Fabricated or ungrounded content generated by a model that does not reflect factual truth. "retrieving external documents that provide factual grounding and reduce hallucinations"
  • Inference-time scaling: Improving performance by scaling resources or retrieval during inference rather than training. "introduce MassiveDS, a 1.4T-token datastore for studying inference-time scaling."
  • Jaccard threshold: A similarity cutoff based on Jaccard index (here over n-grams) used to filter near-duplicates. "removing samples whose similarity to an evaluation query exceeds a 13-gram Jaccard threshold."
  • No RAG baseline: The comparison setting where the model answers without any retrieved context. "We compare our approach against the No RAG baseline where the model answers the query without any retrieved context."
  • Open-domain question answering: QA where evidence must be found from broad, unstructured sources rather than a closed set. "This paradigm has been highly effective for factual and open- domain question answering, where the main challenge is access to relevant information."
  • OpenRouter: A hosted API gateway providing an OpenAI-compatible interface to query models. "We query the target model through an OpenRouter-hosted OpenAI-compatible interface."
  • pass@1: A code-generation metric measuring the probability that the first sampled program passes all tests. "For LiveCodeBench, each sampled program is evaluated using the standard pass@1 criterion, and the reported score is the average over 4 samples."
  • Procedural scaffolds: Concise, structured step sequences that guide the solution process. "Structural normalization turns them into concise procedural scaffolds that are easier to match, read, and reuse as inference-time guidance."
  • Retrieval-augmented generation (RAG): A paradigm where a generator conditions on retrieved external context to improve outputs. "Retrieval-augmented generation (RAG) has proven effective for knowledge- intensive tasks, but is widely believed to offer limited benefit for reasoning- intensive problems such as math and code generation."
  • Reflection: A T3 transformation that reframes traces to highlight mistakes and corrective strategies. "Reflection Reflect. This strategy rewrites a reasoning trace in a contrastive form focused on mistakes and how to avoid them."
  • Retriever: The component that selects the most relevant items from a corpus given a query. "Given Cy, a retriever R returns the top-k units D(q; CT, k) = {[],. , Tk}."
  • Retrieval corpus: The collection of documents or traces against which queries are retrieved. "These transformed traces form the retrieval corpus."
  • Retrieval units: The atomic pieces (e.g., full or chunked traces) that are the targets of retrieval. "where each retrieval unit corresponds either to a full or chunked raw trajectory Ti E T ."
  • Sampling temperature: A decoding parameter controlling randomness in token sampling. "we allow up to 16K generation tokens and use a sampling temperature of 0.6 when applicable."
  • Semantic Distillation: A T3 transformation that compresses traces to their core ideas, omitting lower-level steps. "Semantic Distillation Semantic. This strategy keeps the core idea of a reasoning trace while removing lower-level detail."
  • Structural Normalization: A T3 transformation that rewrites traces into clean, canonical step-by-step procedures. "Structural Normalization Struct This strategy preserves the step-by-step structure of a reasoning trace while rewriting it into a cleaner, more canonical form."
  • Thinking traces: Intermediate reasoning trajectories produced during problem-solving, used here as a retrieval corpus. "retrieving thinking traces, i.e., intermediate thinking trajectories generated during problem solving attempts."
  • Top-3: A retrieval setting that returns the three highest-scoring items for a query. "For retrieval, we use e5-base as our primary encoder for both queries and thinking traces to retrieve top-3 documents."
  • Top-k: A retrieval setting that returns the k highest-scoring items for a query. "Given Cy, a retriever R returns the top-k units D(q; CT, k) = {[],. , Tk}."

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 3 tweets with 55 likes about this paper.