RAG over Thinking Traces Can Improve Reasoning Tasks
Abstract: Retrieval-augmented generation (RAG) has proven effective for knowledge-intensive tasks, but is widely believed to offer limited benefit for reasoning-intensive problems such as math and code generation. We challenge this assumption by showing that the limitation lies not in RAG itself, but in the choice of corpus. Instead of retrieving documents, we propose retrieving thinking traces, i.e., intermediate thinking trajectories generated during problem solving attempts. We show that thinking traces are already a strong retrieval source, and further introduce T3, an offline method that transforms them into structured, retrieval-friendly representations, to improve usability. Using these traces as a corpus, a simple retrieve-then-generate pipeline consistently improves reasoning performance across strong models and benchmarks such as AIME 2025--2026, LiveCodeBench, and GPQA-Diamond, outperforming both non-RAG baselines and retrieval over standard web corpora. For instance, on AIME, RAG with traces generated by Gemini-2-thinking achieves relative gains of +56.3%, +8.6%, and +7.6% for Gemini-2.5-Flash, GPT-OSS-120B, and GPT-5, respectively, even though these are more recent models. Interestingly, RAG on T3 also incurs little or no extra inference cost, and can even reduce inference cost by up to $15%$. Overall, our results suggest that thinking traces are an effective retrieval corpus for reasoning tasks, and transforming them into structured, compact, or diagnostic representations unlocks even stronger gains. Code available at https://github.com/Narabzad/t3.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
What is this paper about?
This paper asks a simple question with a big twist: Can AI solve math, science, and coding problems better if, instead of looking up facts on the web, it looks up how other AIs thought through similar problems? The authors show that the answer is yes. They call these step-by-step solution attempts โthinking traces,โ and they build a system that retrieves helpful traces to guide a modelโs reasoning. They also introduce T3, a way to clean and shrink those traces so theyโre easier to reuse.
What questions did the researchers ask?
- If we use Retrieval-Augmented Generation (RAG)โwhich means โlook things up before answeringโโbut retrieve thinking steps (not regular web pages), does reasoning improve?
- Can we transform raw thinking traces into clearer, shorter โhow-toโ guides that help even more?
- Do these ideas work across different tasks (math, science questions, coding) and different AI models?
- Can this approach even lower the cost of running the AI?
How did they do it? (With simple analogies)
Think of an AI taking a test. Before answering a new question, itโs allowed to quickly flip through a small โlibraryโ of worked solutions from past problems. But instead of reading full textbooks (web pages), it skims other studentsโ scratch workโhow they thought, what steps they tried, what mistakes they avoided. Thatโs the main idea.
Here are the key pieces, explained in everyday language:
- Retrieval-Augmented Generation (RAG): Before answering, the AI searches a library for helpful material to read. Traditionally, this library is made of web articles or documents. Here, the library is made of thinking tracesโstep-by-step solution attempts from other problems.
- Thinking traces: These are like a studentโs scratch paper. They include the steps, key ideas, and sometimes the wrong turns taken while solving a problemโnot just the final answer.
- T3 (Transformation of Thinking Traces): Raw scratch work can be long and messy, so T3 turns it into tidy, quick-to-use notes. The paper uses three styles:
- Struct (Structural Normalization): Turns a messy solution into a clean, numbered โrecipeโ of stepsโlike a well-formatted lab procedure.
- Semantic (Semantic Distillation): Boils down the core ideaโโthe main trickโโwithout all the low-level details, like a one-paragraph strategy summary.
- Reflect (Reflection): A guide to mistakes and fixesโwhat people usually do wrong and how to avoid it, plus a short note on the correct approach. Think โcommon pitfallsโ and โpro tips.โ
- Chunking: Breaking long traces into smaller pieces so the AI retrieves only the most relevant parts. This reduces noise and distraction.
- Benchmarks (the โtestsโ they used):
- AIME 2025โ2026: Very challenging math competition problems.
- GPQA-Diamond: Tough graduate-level science questions.
- LiveCodeBench: Programming problems that test code generation.
The pipeline works like this:
- Offline, a strong model (a good โthinkerโ) solves lots of practice problems and produces thinking traces.
- A smaller model uses T3 to clean and compress those traces into a retrieval-friendly library.
- At test time, when a new question arrives, a retriever grabs the top few most relevant trace snippets.
- The solver model reads those hints and writes the final answer.
The โthinker,โ the โtransformer,โ and the โsolverโ can be different models, which makes the library reusable across systems.
What did they find, and why is it important?
Here are the main takeaways:
- Retrieving thinking traces beats retrieving web pages for reasoning. For hard math (AIME), pulling in thinking traces gave large, consistent gainsโoften much bigger than using standard web or textbook corpora. Even very strong models got better with trace retrieval.
- Cleaned-up traces (T3) work even better than raw traces. Because raw scratch work can be long and noisy, T3โs tidy formats (Struct, Semantic, Reflect) help the solver focus on what matters. Which format is best depends on the task:
- Math often benefits from Reflect or Semantic (avoid common traps + capture the key idea).
- Science and code also gain, though improvements are smaller than math.
- Quality matters more than quantity. Traces created by a stronger โthinkingโ model helped more than having a larger pile of lower-quality traces. In other words, better worked examples beat just having more of them.
- Seeing the steps helps more than seeing only final answers. Retrieving full reasoning (the โhowโ) is usually more helpful than retrieving just the final outputs from similar problems.
- It can even save money. Sometimes, giving the model the right hints up front means it writes less, which can reduce inference cost (in the best case reported, up to about 15% cheaper per question). In general, T3 gave the best balance of accuracy and cost.
Why this matters: Many people believed retrieval helps when you need facts, but not when you need to reason. This paper shows the limit wasnโt retrievalโit was the kind of stuff we retrieved. If you retrieve the right โthinking help,โ reasoning improves.
What could this change in the real world?
- Smarter study helpers and tutors: Imagine an AI math tutor that doesnโt just show answers, but retrieves short, clear โhow-toโ strategies and common mistakes to avoid for a new problem.
- Better AI tools for hard tasks: Coding assistants and scientific Q&A systems could become more reliable by reusing the best past reasoning patterns, not just documentation.
- Reusable reasoning libraries: Teams can build shared libraries of thinking traces that help many models, including newer or smaller ones. This is like handing down high-quality study notes to the next class.
- Lower costs (sometimes): If the hints are strong, the AI can do less trial-and-error, which can cut token usage and save money.
- A new research direction: Instead of only training models to internalize reasoning, we can store, clean, and retrieve reasoning externally. That opens up safer, more controllable systemsโbecause you can inspect, curate, and improve the library over time.
Final note: While results are strongโespecially for mathโgains vary by task and model, and picking the right transformation (Struct, Semantic, or Reflect) matters. Still, the big idea holds: giving AI access to good โworked examples of thinking,โ not just facts, can significantly improve how well it reasons.
Knowledge Gaps
Knowledge Gaps, Limitations, and Open Questions
Below is a concise, actionable list of what remains uncertain or unexplored in the paper and would benefit from targeted follow-up research:
- Corpus scaling laws: How do gains scale with the number of thinking traces beyond 59kโ114k? Identify saturation points, diminishing returns, and optimal corpus sizes per task.
- Domain coverage: The trace corpus is math-heavy; quantify how performance changes with domain-balanced or domain-specific trace corpora for science QA, code, and other domains.
- Task generality: Evaluate on broader reasoning tasks (multi-hop commonsense, planning/agents, theorem proving, formal logic, program synthesis with tool use, multi-turn tutoring).
- Cross-modal reasoning: Assess whether thinking traces help multimodal reasoning (e.g., geometry with diagrams, charts, code + execution traces).
- Cross-lingual transfer: Do traces in one language benefit queries in another? What representations enable cross-lingual reuse?
- Trace quality assessment: Develop automatic metrics/filters to detect and discard low-quality or erroneous reasoning traces prior to indexing.
- Robustness to noisy/misleading traces: Quantify the impact of incorrect or adversarially crafted traces on solver accuracy; design defenses (e.g., trace consistency checks).
- Data poisoning risk: Analyze and harden the pipeline against malicious trace injection at index time (security model, provenance, signatures).
- Decontamination sufficiency: A 13-gram Jaccard threshold may miss paraphrased or step-level leaks; study stronger semantic decontamination and step-structure similarity filters.
- Contamination auditing: Provide per-problem audits of retrieved traces to ensure no near-duplicate solutions or verbatim reasoning from evaluation items slipped through.
- Retriever choice and training: Only e5-base was used; benchmark specialized retrievers (Contriever, ColBERT, GTR, multi-vector, hybrid sparse+dense) and fine-tune retrievers on reasoning queries.
- Retrieval hyperparameters: Systematically ablate top-k, chunk size, and passage granularity (e.g., step-level indexing vs fixed-length chunks) across tasks.
- Query-aware transformation: T3 is offline and query-agnostic; explore lightweight, on-the-fly, query-conditioned rewrites or selective expansion of retrieved traces.
- Transformation model quality: Only one small model (Gemini-2-Flash-Lite) was used; study how transformation quality and model family affect downstream gains and cost.
- Transformation selection policy: Different T3 variants (Struct/Semantic/Reflect) win on different tasks; learn a per-query selector or mixture-of-transformations policy.
- Compression vs fidelity: Characterize the trade-off between trace compression level and utility; identify minimal information needed for consistent gains.
- Provenance and interpretability: Record and surface source thinker, confidence, and validation signals with traces to let solvers weight or ignore context appropriately.
- Cost accounting completeness: Token-cost comparisons omit retrieval/index latency, memory footprint, and network overhead; include end-to-end wall-time and system cost.
- Fair compute comparisons: Normalization across conditions (e.g., generation token budgets, number of samples) needs stricter control to isolate RAG effects.
- Prompt-format sensitivity: Study how hint placement, formatting, and instruction phrasing affect solver uptake of retrieved traces and cost.
- Solverโtrace alignment: Analyze how model family, decoding settings, and CoT prompting modulate benefits; some models showed cost increasesโwhy?
- Combination with iterative RAG: Only retrieve-then-generate was tested; evaluate stepwise retrieval, planner-critic loops, and verifier-guided retrieval integration.
- Knowledge vs process retrieval: For code and science tasks that need facts/APIs, investigate hybrid retrieval that mixes factual documents with thinking traces.
- Error taxonomy and avoidance: Reflect helps in a case study; build a systematic library of common error patterns per domain and quantify their effect at scale.
- Failure modes where RAG hurts: Identify problem categories where traces distract or mislead; develop gating mechanisms to abstain from retrieval.
- Transfer limits across thinkers: Beyond three thinkers, probe transfer across wider families and model sizes; identify characteristics of โportableโ traces.
- Coverage estimation: Define and compute โreasoning coverageโ of the trace index for a target distribution; design active curation to close coverage gaps.
- Online updating: Explore mechanisms to add newly observed traces safely at runtime (experience replay) without catastrophic drift or contamination.
- Evaluation breadth: Beyond exact-match and pass@1, assess faithfulness of reasoning, chain correctness, and calibration under trace-augmented decoding.
- per-subdomain analysis: Break down gains by math topic (algebra/geometry/combinatorics), science field (physics/chem/biology), and code category (algorithms/systems/ML).
- Cross-lingual/code-switch prompts: Test robustness when queries contain mixed languages, symbols, or non-ASCII math notation.
- Long-context limits: Investigate behavior under very long problems and multi-step chains near context limits; study summarization-and-retrieval hybrids.
- Licensing and privacy: Clarify legal/ethical implications of releasing traces derived from proprietary models/datasets and handling traces with potential sensitive content.
- Reproducibility risks: Hosted API models evolve; fix versions/seeds and release all prompts/indices to ensure results can be replicated over time.
- Benchmark diversity: Add larger, more varied, and privately sourced tasks to validate external generalization and reduce overfitting to public benchmarks.
- Theoretical grounding: Develop a formal account of when and why process-level retrieval should help (e.g., biasโvariance trade-offs, inductive bias alignment).
- Multi-agent settings: Test whether shared trace indices enable coordination and faster convergence for agentic teams or tool-using systems.
Practical Applications
Immediate Applications
Below are concrete, deployable use cases that can be built now by leveraging thinking-trace RAG and the T3 transformations (Struct, Semantic, Reflect). Each item includes suggested tools/workflows and key dependencies to consider.
- Coding copilot boost in IDEs (software)
- What: Improve pass rates and reduce hallucinations in code generation by retrieving structured reasoning traces for similar tasks (e.g., algorithm patterns, debugging steps). Use T3-Struct for reusable scaffolds and T3-Reflect to surface common pitfalls.
- Tools/workflows: VS Code/JetBrains extensions; a โTraceDBโ index built from high-quality code traces (e.g., LiveCodeBench-like corpora); e5/Contriever retriever; top-k insertion into the prompt.
- Assumptions/dependencies: Availability of decontaminated code reasoning traces; licensing/IP for traces; strong retriever quality; IDE context window limits.
- Math and STEM tutoring assistants (education)
- What: Deliver step-by-step hints, high-level insights, and โwhat to avoidโ tips by retrieving thinking traces for similar problems. T3-Semantic provides concise insights; T3-Reflect helps with misconception diagnosis.
- Tools/workflows: LMS/LXP integration; hint-augmented prompts; student-progress tagging to select appropriate โTraceCards.โ
- Assumptions/dependencies: Age/level-aligned trace corpora; pedagogy-aligned hinting; privacy when storing student attempts.
- Internal troubleshooting copilots for IT and DevOps (software/IT ops)
- What: Retrieve structured postmortem-like reasoning traces for outages or build failures to guide triage. T3-Reflect highlights common misdiagnoses; T3-Struct provides action checklists.
- Tools/workflows: Integration with ticketing (Jira/ServiceNow) and logs (ELK, Datadog); per-incident trace capture and transformation pipeline; RAG on top of past incidents.
- Assumptions/dependencies: Secure storage for sensitive operational data; de-identification; retriever tuned to logs/error signatures.
- Customer support copilot with diagnostic playbooks (services)
- What: Shorten time-to-resolution by retrieving diagnostic reasoning traces from resolved tickets for similar symptoms. T3-Struct for SOP-like steps; T3-Reflect to prevent repeated errors.
- Tools/workflows: CRM/CS platforms (Zendesk, Salesforce) plugins; symptom-to-trace retrieval; โquick checksโ cards in the agent UI.
- Assumptions/dependencies: Sufficient volume of high-quality resolved cases; privacy/PII handling; domain adaptation across products.
- Data science/analytics assistants (software/enterprise analytics)
- What: Retrieve prior reasoning over similar analyses (e.g., cohort definitions, metric anomalies, causal checks), improving reproducibility and speed.
- Tools/workflows: Jupyter/Notebooks integration; โAnalysis TraceStoreโ indexing SQL, Python, and narrative rationale; T3-Semantic to compress insights.
- Assumptions/dependencies: Access to validated analytics write-ups; governance for metric definitions; versioning of datasets.
- Scientific literature Q&A assistants (academia/science)
- What: Use T3-Struct traces to guide multi-step reasoning for graduate-level Q&A (GPQA-like), complementing citation retrieval with process-level guidance.
- Tools/workflows: Hybrid RAG (papers + thinking traces); top-k trace blending; answer normalization and verification scripts.
- Assumptions/dependencies: High-quality, domain-matched traces; prevention of overreliance on outdated heuristics; careful decontamination.
- Cost-optimized LLM workflows in enterprises (cross-industry)
- What: Reduce inference cost by shifting some compute into compact retrieved context (observed up to ~15% cheaper for certain models) while maintaining or improving accuracy.
- Tools/workflows: Cost-aware routing to T3 vs. no-RAG; pricing-aware prompt shaping (input vs. output token ratios); A/B testing.
- Assumptions/dependencies: Model pricing regimes where input is cheaper than output; retrieval precision to avoid irrelevant context that inflates input tokens.
- Recruitment and assessment tools for problem solving (HR/education)
- What: Provide candidates with โhint packsโ derived from T3-Reflect traces that surface common pitfalls in logic/math/coding tasks without revealing full solutions.
- Tools/workflows: Assessment platforms plug-in; dynamic hint difficulty control; per-item decontamination checks.
- Assumptions/dependencies: Valid alignment with test integrity; curated trace sets that donโt leak answers.
- Auditable reasoning for AI evaluation (policy/ML governance)
- What: Store and retrieve thinking traces used during inference to support audit trails and error analysis; compare performance with vs. without traces.
- Tools/workflows: โTraceLedgerโ for versioned storage; standardized metadata (model, date, domain, decontamination hash); evaluation harness integration.
- Assumptions/dependencies: Organizational policy for trace retention; privacy and compliance controls; clear contamination policy.
Long-Term Applications
These opportunities likely require further research, larger-scale trace corpora, domain-specific governance, or tooling maturation.
- Clinical decision support with process-aware guidance (healthcare)
- What: Retrieve diagnostic reasoning pathways (differentials, red flags, โdo-not-missโ checks) via T3-Reflect/Struct to assist clinicians.
- Tools/workflows: EHR-integrated trace retrieval; verifiable pathways with references; guardrails for scope-of-practice.
- Assumptions/dependencies: Clinically validated, peer-reviewed traces; regulatory approval; robust de-identification; liability frameworks.
- Compliance and risk analysis copilots (finance/legal)
- What: Retrieve argument structures, precedent-based reasoning, and common failure modes for regulatory interpretations or policy audits.
- Tools/workflows: Domain-specific trace ontologies; chain-of-reasoning provenance; dual retrieval (statutes/precedents + traces).
- Assumptions/dependencies: Up-to-date legal corpora; jurisdiction-aware decontamination; explainability standards.
- Autonomous agents with reusable thought memory (robotics/agents)
- What: Equip agents with a โThought Memoryโ that retrieves prior task trajectories (planning, tool-use sequences, failure recovery) to improve long-horizon reliability.
- Tools/workflows: Thought graph index (building on T3 + Retrieval-of-Thought concepts); dynamic assembly of templates; reflective updates post-execution.
- Assumptions/dependencies: Safety evaluation in real environments; task generalization; latency constraints for on-device retrieval.
- Industry playbooks as structured reasoning libraries (manufacturing/energy)
- What: Convert tacit expert procedures into T3-Struct traces for maintenance, safety checks, and optimization; retrieve by equipment state and sensor signatures.
- Tools/workflows: IoT telemetry-to-trace retrieval; procedural verification; multilingual trace normalization.
- Assumptions/dependencies: Data partnerships; union/safety rules; continuous updates to reflect equipment changes.
- National/sectoral โtrace commonsโ (policy/standards)
- What: Public repositories of decontaminated thinking traces for education, research, and benchmarking; standardized metadata and sharing protocols.
- Tools/workflows: Open โTrace Commonsโ API; auditing and decontamination services; licensing regimes for trace reuse.
- Assumptions/dependencies: Incentives for contribution; governance for misuse prevention; quality control and bias monitoring.
- Marketplace and exchange for domain traces (platforms)
- What: Curated, licensed trace bundles (e.g., tax preparation reasoning, lab protocols, cybersecurity triage) sold to organizations to upgrade their RAG stacks.
- Tools/workflows: Quality scoring; contamination risk guarantees; vertical-specific retrieval adapters.
- Assumptions/dependencies: IP clarity; contractual compliance; measurable ROI vs. in-house trace generation.
- Curriculum design and adaptive learning with misconception libraries (education)
- What: At-scale T3-Reflect libraries of common student errors to drive adaptive practice sequences and feedback generation.
- Tools/workflows: Alignment with standards (e.g., CCSS, NGSS); teacher dashboards showing misconception heat maps; controlled hinting policies.
- Assumptions/dependencies: Longitudinal data; fairness and accessibility; teacher adoption and training.
- Safety, alignment, and oversight tools (AI safety)
- What: Use trace retrieval to detect faulty reasoning patterns, steer models away from dangerous chains, and document mitigations.
- Tools/workflows: โRed-flag Reflectโ traces for high-risk domains; alignment policies encoded as retrievable checks; sandboxed evaluation suites.
- Assumptions/dependencies: Consensus on risky patterns; continuous red-teaming; compatibility with frontier modelsโ context limits.
- Enterprise โreasoning memoryโ mesh (cross-industry)
- What: Organization-wide index of validated reasoning traces across departments (engineering, ops, sales, finance) to avoid rediscovering solutions and propagate best practices.
- Tools/workflows: Federated trace indexing; access control; lineage tracking; KPI-linked feedback loops (which traces helped).
- Assumptions/dependencies: Data silos and governance; change management; security and compliance.
- On-device/private trace caches (privacy-first applications)
- What: Personal or edge caches of thinking traces for offline or privacy-sensitive use (e.g., medical note drafting, personal finance planning).
- Tools/workflows: Lightweight retrievers; compression-focused T3 variants; periodic secure sync.
- Assumptions/dependencies: Efficient on-device retrieval; secure enclaves; differential privacy for shared improvements.
Cross-cutting assumptions and dependencies
Before adopting these applications, consider the following factors that influence feasibility and ROI:
- Trace quality over quantity: Results show stronger โthinkerโ models produce more useful traces than simply scaling corpus size.
- Domain match: Math-heavy corpora transfer partially to code/science; domain-specific traces improve gains and reliability.
- Decontamination and leakage: Rigorous n-gram/semantic filters to avoid test contamination, IP leakage, or answer leakage in assessments.
- Retrieval precision: Chunking and transformation (T3) often outperform full raw traces; poor retrieval can negate benefits.
- Cost model sensitivity: Benefits are largest when input tokens are cheap relative to output tokens; monitor vendor pricing and context-window costs.
- Governance, privacy, and IP: Storing and sharing reasoning traces may expose sensitive logic or PII; require policy, contracts, and auditing.
- Human factors: In education and high-stakes domains, ensure traces align with pedagogy, regulations, and expert workflows; provide override and verification steps.
- Tooling maturity: Productionizing needs MLOps for trace generation, transformation, indexing, evaluation, and drift monitoring.
Glossary
- Ablation: A controlled analysis that varies or removes components to measure their effect on performance. "We further provide an ablation on the number of retrieved documents in Appendix D, showing that retrieving three documents yields the most stable performance."
- Average@4: An evaluation setting reporting the average score across four independent samples per query. "For larger benchmarks such as GPQA-Diamond and LiveCodeBench, we use 4 samples per query and report Average@4."
- Average@8: An evaluation setting reporting the average score across eight independent samples per query. "For AIME, where the benchmark is small, we use 8 samples per query and report the average across them (Average@8)."
- Chain-of-thought: A prompting technique that elicits explicit, step-by-step reasoning from a model. "Prior work has improved reasoning through prompting strategies such as chain-of- thought (Wei et al., 2022; Wang et al., 2022)"
- Chunked trajectories: Reasoning traces split into fixed-length segments for retrieval and context efficiency. "we compare retrieval over full trajectories... with chunked trajectories, where traces are split into fixed-length segments of 512 tokens."
- Contrastive form: A representation emphasizing differences (e.g., between mistakes and correct approaches) to guide reasoning. "This strategy rewrites a reasoning trace in a contrastive form focused on mistakes and how to avoid them."
- Contamination: Unwanted overlap between training/retrieval data and evaluation queries that can inflate measured performance. "Because the trace corpus is built from a fully separate auxiliary problem set and generated by different models than those used at inference time, it also remains cleanly separated from the evaluation queries and reduces the risk of contamination."
- Contriever: A dense retrieval model used to index and serve large corpora for neural search. "DS- Serve (Liu et al., 2026), which serves the full corpus with Contriever (Izacard et al., 2021), enabling a comparison under larger-scale retrieval with a different retriever."
- Cost-accuracy trade-off: The balance between inference expense (e.g., tokens, compute) and achieved accuracy. "Figure 1 summarizes the average cost-accuracy trade-off across the three benchmarks."
- Datastore: A large indexed store of tokenized data used to support retrieval at inference time. "Shao et al. (2024) show that increasing datastore size can improve retrieval- based LLMs and introduce MassiveDS, a 1.4T-token datastore for studying inference-time scaling."
- Decontamination: The process of removing or filtering data that overlaps with test sets to prevent leakage. "Decontamination. Following prior work (Borgeaud et al., 2022; Lyu et al., 2025), we decontaminate both collections against the evaluation benchmarks by removing samples whose similarity to an evaluation query exceeds a 13-gram Jaccard threshold."
- e5-base: A sentence/embedding encoder used as the primary retriever for queries and documents. "For retrieval, we use e5-base as our primary encoder for both queries and thinking traces to retrieve top-3 documents."
- EleutherAI LM Evaluation Harness: A standardized toolkit for benchmarking LLMs across tasks. "We evaluate retrieval-augmented reasoning using the EleutherAI LM Evalua- tion Harness (Gao et al., 2024) with custom task definitions."
- Exact-match accuracy: A metric counting a prediction as correct only if it exactly matches the gold answer. "For AIME and GPQA-Diamond, we report exact-match accuracy."
- Factual grounding: Supplying verified information in the prompt to anchor generation and reduce errors. "RAG has become a standard approach for improving LLMs on knowledge-intensive tasks by retrieving external documents that provide factual grounding and reduce hallucinations"
- Frontier models: The most advanced, contemporary LLMs. "We run extensive experiments across multiple frontier models, including GPT-OSS-120B (OpenAI Team, 2025a), GPT-5 (OpenAI Team, 2025b), and Gemini-2.5-Flash (Gemini Team, 2023)"
- Hallucinations: Fabricated or ungrounded content generated by a model that does not reflect factual truth. "retrieving external documents that provide factual grounding and reduce hallucinations"
- Inference-time scaling: Improving performance by scaling resources or retrieval during inference rather than training. "introduce MassiveDS, a 1.4T-token datastore for studying inference-time scaling."
- Jaccard threshold: A similarity cutoff based on Jaccard index (here over n-grams) used to filter near-duplicates. "removing samples whose similarity to an evaluation query exceeds a 13-gram Jaccard threshold."
- No RAG baseline: The comparison setting where the model answers without any retrieved context. "We compare our approach against the No RAG baseline where the model answers the query without any retrieved context."
- Open-domain question answering: QA where evidence must be found from broad, unstructured sources rather than a closed set. "This paradigm has been highly effective for factual and open- domain question answering, where the main challenge is access to relevant information."
- OpenRouter: A hosted API gateway providing an OpenAI-compatible interface to query models. "We query the target model through an OpenRouter-hosted OpenAI-compatible interface."
- pass@1: A code-generation metric measuring the probability that the first sampled program passes all tests. "For LiveCodeBench, each sampled program is evaluated using the standard pass@1 criterion, and the reported score is the average over 4 samples."
- Procedural scaffolds: Concise, structured step sequences that guide the solution process. "Structural normalization turns them into concise procedural scaffolds that are easier to match, read, and reuse as inference-time guidance."
- Retrieval-augmented generation (RAG): A paradigm where a generator conditions on retrieved external context to improve outputs. "Retrieval-augmented generation (RAG) has proven effective for knowledge- intensive tasks, but is widely believed to offer limited benefit for reasoning- intensive problems such as math and code generation."
- Reflection: A T3 transformation that reframes traces to highlight mistakes and corrective strategies. "Reflection Reflect. This strategy rewrites a reasoning trace in a contrastive form focused on mistakes and how to avoid them."
- Retriever: The component that selects the most relevant items from a corpus given a query. "Given Cy, a retriever R returns the top-k units D(q; CT, k) = {[],. , Tk}."
- Retrieval corpus: The collection of documents or traces against which queries are retrieved. "These transformed traces form the retrieval corpus."
- Retrieval units: The atomic pieces (e.g., full or chunked traces) that are the targets of retrieval. "where each retrieval unit corresponds either to a full or chunked raw trajectory Ti E T ."
- Sampling temperature: A decoding parameter controlling randomness in token sampling. "we allow up to 16K generation tokens and use a sampling temperature of 0.6 when applicable."
- Semantic Distillation: A T3 transformation that compresses traces to their core ideas, omitting lower-level steps. "Semantic Distillation Semantic. This strategy keeps the core idea of a reasoning trace while removing lower-level detail."
- Structural Normalization: A T3 transformation that rewrites traces into clean, canonical step-by-step procedures. "Structural Normalization Struct This strategy preserves the step-by-step structure of a reasoning trace while rewriting it into a cleaner, more canonical form."
- Thinking traces: Intermediate reasoning trajectories produced during problem-solving, used here as a retrieval corpus. "retrieving thinking traces, i.e., intermediate thinking trajectories generated during problem solving attempts."
- Top-3: A retrieval setting that returns the three highest-scoring items for a query. "For retrieval, we use e5-base as our primary encoder for both queries and thinking traces to retrieve top-3 documents."
- Top-k: A retrieval setting that returns the k highest-scoring items for a query. "Given Cy, a retriever R returns the top-k units D(q; CT, k) = {[],. , Tk}."
Collections
Sign up for free to add this paper to one or more collections.