TraceYourThinking: Tracing Reasoning Processes

Updated 4 July 2026

TraceYourThinking is a research theme that defines and analyzes intermediate cognitive traces beyond final outputs.
Researchers use structured methods like episode segmentation, thought graphs, and subthoughts to diagnose reasoning accuracy and overthinking.
Practical applications include improved model oversight, retrieval-augmented generation, and enhanced interpretability in fields like math, coding, and human-AI interaction.

TraceYourThinking can be understood as a research theme, and in one paper an open-sourced semi-structured chatbot/interview pipeline, devoted to making latent reasoning processes traceable rather than evaluating only final outputs. Across recent work, the traced object ranges from chain-of-thought and sentence-level episodes to minimally complete sub-thoughts, thought graphs, search-session cognitive labels, software execution traces, and users’ self-reported reasons and reactions (Li et al., 16 Oct 2025). The unifying premise is that intermediate structure contains information that final answers, token counts, or click logs alone do not: models may drift after reaching a correct intermediate state, expose stable macro-phases such as analysis and verification, reveal over-exploration or over-verification, or encode user intentions and dissatisfaction that remain absent from visible dialogue (Li et al., 23 Dec 2025).

1. Scope and conceptual motivation

A central motivation of TraceYourThinking work is dissatisfaction with endpoint-only evaluation. In mathematical reasoning, one line of work argues that standard evaluation “generate a full trace once, extract the final answer, judge correctness” can miss information hidden in the trace, because a model may converge to the correct answer from many intermediate states yet drift into an error at the end (Hammoud et al., 29 Apr 2025). In reasoning-model analysis, token-level statistics can show response length but not whether a trace is reasoning abstractly or executing mechanically, whether it is exploring alternatives, whether it is checking its work, or whether it exhibits stable progression or repeated feedback loops (Li et al., 23 Dec 2025). In human–AI interaction, visible chat messages are described as a lossy projection of user cognition, because they capture what users say but not why they sent a prompt or how they privately evaluated a reply (Jin et al., 19 May 2026).

These papers therefore shift the unit of analysis from the answer to the process. Some do so by segmenting traces into cognitively meaningful units; others by intervening on the trace, retrieving over traces, visualizing traces, or collecting traces directly from humans. A plausible implication is that “TraceYourThinking” names not a single algorithm but a family of methods for turning intermediate reasoning into a first-class empirical object.

Focus	Core representation	Representative work
Model reasoning structure	Episodes, subthoughts, thought units, thought graphs	ThinkARM (Li et al., 23 Dec 2025), subthought reasoning (Hammoud et al., 29 Apr 2025), ThinkProbe (Kerkouri et al., 27 Jun 2026)
Oversight and control	Truncation curves, utility traces, latent thought vectors, retrieved traces	TRACE (Wang et al., 1 Oct 2025), overthinking TRACE (Zhang et al., 9 Oct 2025), rethinking (Kong et al., 6 Feb 2026), T3 (Arabzadeh et al., 5 May 2026)
Human and interaction traces	Self-reported thoughts, cognitive labels, belief networks, literate traces	ThoughtTrace (Jin et al., 19 May 2026), cognitive traces in search (Zerhoudi et al., 27 Feb 2026), HugAgent/TraceYourThinking (Li et al., 16 Oct 2025), literate tracing (Sotoudeh, 10 Oct 2025)

2. Structural representations of reasoning

One major strand formalizes reasoning traces as structured cognitive sequences. “Schoenfeld’s Anatomy of Mathematical Reasoning by LLMs” introduces ThinkARM, which adapts Schoenfeld’s Episode Theory to sentence-level annotation of model traces. Its eight functional categories are Read, Analyze, Plan, Implement, Explore, Verify, Monitor, and Answer, with Answer added as an explicit eighth category because LLMs often end with a formatted final response that should be separated from verification or execution (Li et al., 23 Dec 2025). The closely related benchmark paper using DeepSeek-R1 on SAT Mathematics items employs a seven-label sentence-level schema—Read, Analyze, Plan, Implement, Explore, Verify, Monitor—and a three-label paragraph-level schema—General, Explore, Verify—over 38 math problems, 915 paragraphs, and 3,087 sentences (Li et al., 18 Sep 2025).

ThinkARM’s episode-level view yields a reproducible three-phase “heartbeat”: Initialization dominated by Read, Analyze, Plan, Explore; Execution with Implement peaking in the middle; and Convergence in which Verify and Monitor rise near the end before Answer (Li et al., 23 Dec 2025). Correctness diagnostics are not limited to length. The paper engineers three feature groups—global statistics, episode intensities, and a flattened $8 \times 8$ transition matrix—and applies a Lasso-regularized logistic regression. The most predictive positive features include Explore $\to$ Monitor, Explore $\to$ Analyze, Monitor $\to$ Analyze, and Read $\to$ Verify, whereas negative features include high Explore ratio, Explore $\to$ Verify, and Implement $\to$ Read. The suggested interpretation is that successful traces re-route uncertainty into monitoring and renewed conceptual work rather than proceeding blindly (Li et al., 23 Dec 2025).

Graph-based formalisms generalize this idea beyond linear episode sequences. ThinkProbe converts a > trace into a directed Thought Graph with cycles, 8 node types, 6 edge types, and a 19-metric 5D cognitive profile spanning Breadth, Depth, Structure, Metacognitive, and Efficiency (Kerkouri et al., 27 Jun 2026). Its node taxonomy consists of HYP, RFR, JUS, SPC, CRT, CMP, MET, and SYN; its edge taxonomy consists of SEQ, BRCH, ELAB, BACK, SYNT, and CRIT. The paper reports that 95.6% of traces contain at least one directed cycle, with median cycle length 14 nodes, which is offered as evidence that DAGs or trees fail to capture backtracking, convergence after branching, and cross-branch synthesis. On 4,200 traces from 7 native reasoning models across 200 open-ended questions and 10 cognitive domains, between-model variance exceeds between-domain variance by up to fourfold across four of five dimensions, while Structure remains genuinely domain-sensitive (Kerkouri et al., 27 Jun 2026).

A related structural account of overthinking defines a sub-thought by three criteria—self-contained, complete, and answer-bearing—then infers discourse relations such as verification, correction, backtrack, branching out, and sidetrack to build thought progression graphs (Zhang et al., 9 Oct 2025). This work identifies two dominant patterns, Explorer and Late Landing, and introduces a utility-based definition of overthinking: thought continues beyond the point where $\Delta \text{Performance} / \Delta \text{Thought}$ drops below a threshold $\epsilon$ . In a temporal-reasoning case study with $\epsilon=0$ for Qwen3-235B-A22B and $\to$ 0 for Qwen3-32B, the convergence point appears after the eighth sub-thought for both models (Zhang et al., 9 Oct 2025).

3. Trace-based diagnosis, aggregation, and optimization

TraceYourThinking methods are often operational rather than purely descriptive: they manipulate or re-use traces to improve accuracy, detect failure modes, or re-allocate inference-time computation. In “Beyond the Last Answer,” the trace is segmented into sequential subthoughts using linguistic cues such as “Wait,” “Alternatively,” “Another angle,” “But wait,” “Hmm,” “Maybe,” “Looking back,” “Let me,” “Then,” “Now,” “Therefore,” and “Thus” (Hammoud et al., 29 Apr 2025). From each intermediate boundary, the model continues reasoning, a final answer is extracted, and the answers are aggregated by the mode. On AIME2024 and AIME2025, this yields gains up to +13.33% and +10.0% respectively; the paper further reports that it did not observe cases where $\to$ 1 was correct but $\to$ 2 was wrong. The answer distribution’s Shannon entropy functions as a stability signal: low entropy correlates with correct traces and high entropy with instability or uncertainty (Hammoud et al., 29 Apr 2025).

For oversight, “Is It Thinking or Cheating?” introduces TRACE—Truncated Reasoning AUC Evaluation—to detect implicit reward hacking by measuring how early a truncated chain-of-thought already suffices to pass a verifier (Wang et al., 1 Oct 2025). For each truncation point, the method forces a final answer using tags like <answer>, then measures verifier-passing rate. In math it samples 5 answers at temperature 0.7 and computes the fraction that pass the verifier; in code it samples with temperature 0 and computes the fraction of test cases passed. TRACE is the area under the pass-rate versus CoT-length curve, conceptually $\to$ 3. A higher score is more suspicious, because it indicates lower apparent reasoning effort. The paper reports over 65% gains over the strongest 72B CoT monitor in math and over 30% gains over a 32B monitor in coding, and shows that TRACE can discover unknown loopholes during training by clustering scalar TRACE scores with K-means (Wang et al., 1 Oct 2025).

Optimization-oriented work moves from analyzing visible traces to refining latent reasoning states. “Inference-Time Rethinking with Latent Thought Vectors for Math Reasoning” factorizes reasoning as $\to$ 4, where a continuous latent thought vector represents what to reason about and a decoder represents how to reason (Kong et al., 6 Feb 2026). The reported system is a 0.2B-parameter model trained from scratch on GSM8K-Aug, with an 8-layer decoder, 2-layer encoder, 64 latent tokens, and a short context window $\to$ 5. At test time it alternates generate → reflect for 30 rethinking iterations and keeps the trace with the highest likelihood. The paper reports 31.54% on GSM8K, 51.50% on SVAMP, and 68.00% on MultiArith, surpassing baselines with 10 to 15 times more parameters, including a 3B counterpart (Kong et al., 6 Feb 2026).

Retrieval-based methods treat traces as reusable corpora. “RAG over Thinking Traces Can Improve Reasoning Tasks” proposes T3, an offline transformation of raw thinking trajectories into Struct, Semantic, and Reflect variants for retrieval-augmented generation (Arabzadeh et al., 5 May 2026). Two corpora are reported: T3-QwQ-32B with 114K reasoning problems and T3-Gemini-2-thinking with 59K reasoning problems. With e5-base retrieval and top- $\to$ 6, raw and transformed traces improve strong solver models on AIME 2025–2026, GPQA-Diamond, and LiveCodeBench. On AIME, Gemini-2.5-Flash rises from 53.3 without RAG to 83.3 with T3 Semantic, a +56.3% relative gain; GPT-5 rises from 86.7 to 93.3 with T3 Reflect. The paper also reports that T3 can reduce inference cost by up to 15% (Arabzadeh et al., 5 May 2026).

4. Visualization, provenance, and executable explanation

Once traces become long and structurally complex, interpretability depends on interface design. ReTrace addresses the problem that raw reasoning traces can exceed 15,000 tokens, force scrolling and skimming, and blur together problem framing, decomposition, self-checking, correction, and abandonment (Felder et al., 14 Nov 2025). It structures DeepSeek-R1 traces with a validated taxonomy containing four main phases—Problem Definition & Scoping, Initial Solution & Exploration, Iterative Refinement & Verification, and Final Decision—plus fine-grained subphases such as Rephrase, Define_Goal, Decomposition_&_Execution, First_Answer, Confidence_Qualification, Pausing_to_Rethink, Correction, Re-examine, Try_Alternative, Abandonment, Stating_Confidence, and Preparing_Output. Its pipeline comprises M1 Separator, M2 Annotator using Gemini 2.5 Pro, and M3 Visualizer.

ReTrace evaluates two visual forms: Space-Filling Nodes, a treemap-like hierarchy, and Sequential Timeline, a chronology-preserving bar layout (Felder et al., 14 Nov 2025). In a within-subjects study with 18 participants, three conditions, and three trials, both visualizations outperform raw trace text on several comprehension measures. Median strategy-summary quality is 1.50 for both visualizations versus 1.00 for Raw Trace; verification richness counts are 3.00 versus 2.00; and median absolute error for estimating the share of reasoning spent on verification is 0.30 percentage points for both visualizations versus 14.60 pp for Raw Trace, with Space-Filling Nodes significantly better than Raw Trace at $\to$ 7. On SEQ, Sequential Timeline achieves median 6, Space-Filling Nodes 5, and Raw Trace 4, with the difference between Sequential Timeline and Raw Trace significant at $\to$ 8. 11 participants preferred Space-Filling Nodes, 6 preferred Sequential Timeline, and only 1 preferred Raw Trace (Felder et al., 14 Nov 2025).

A software-systems analogue is literate tracing, defined as a document that explains how a software system works by walking the reader through concrete execution traces (Sotoudeh, 10 Oct 2025). TReX, the supporting tool, generates traces that are “guaranteed by construction to be faithful to the program semantics” by interrogating a live program under GDB, with output in HTML or LaTeX and visualization code in Python. TReX provides commands such as setExecutable, runUntil, gdbEvalInt, printCode, printCallStack, printExpressionTable, and singleStepper, and has been used to explain components of the Linux kernel, Git, and GCC. The common design principle across ReTrace and TReX is provenance preservation: every abstraction remains linked back to verbatim trace text or debugger-grounded program state (Sotoudeh, 10 Oct 2025).

TraceYourThinking is not limited to model-generated chain-of-thought. Several papers reconstruct or collect traces for human cognition itself. ThoughtTrace is described as the first large-scale dataset pairing real-world multi-turn human–AI conversations with users’ self-reported reasons and reactions (Jin et al., 19 May 2026). It contains 1,058 users, 2,155 conversations, 17,058 turns, and 10,174 thought annotations across 20 LLMs. The reason taxonomy has 7 categories and the reaction taxonomy 5 categories. The paper shows that thoughts are semantically distinct from messages: centroid distance is 0.225 for Message → Reason and 0.320 for Reaction → Next message, versus 0.120 for Current message → Next message; linear-probe AUCs are 0.977 and 0.988 versus 0.721. Frontier models—GPT-5.4, Gemini 3.1 Pro Preview, and Claude Opus 4.6—achieve only 2.93 mean similarity for inferred reasons and 2.54 for inferred reactions on a 1–5 semantic-similarity scale. As inference-time context, annotated thoughts improve next-message prediction from 21.6 to 30.6, a 41.7% relative gain (Jin et al., 19 May 2026).

Search behavior yields another form of inferred cognition. “Beyond the Click” builds a framework grounded in Information Foraging Theory with six cognitive labels—FollowingScent, ApproachingSource, DietEnrichment, PoorScent, LeavingPatch, and ForagingSuccess—assigned by a multi-agent LLM system composed of Analyst, Critic, and Judge (Zerhoudi et al., 27 Feb 2026). The system uses Claude 3.5 Sonnet as Analyst and GPT-4o as Critic and Judge. Human validation relies on 500 sessions and reports Krippendorff’s $\to$ 9; calibrated agent interactions reach 92.4% accuracy against gold labels. On AOL session-outcome forecasting using only the first 50% of events, a cognitive-enhanced model reaches Precision 1.00, Recall 0.82, F1 0.90, and AUC 0.92, compared with a behavioral baseline at F1 0.67 and AUC 0.43. On struggle-recovery prediction using the first 40% of sessions, the cognitive-enhanced model improves F1 from 0.67 to 0.78 and AUC from 0.77 to 0.83 (Zerhoudi et al., 27 Feb 2026).

Human-reasoning elicitation appears explicitly under the name TraceYourThinking in the HugAgent project, where it denotes the open-sourced semi-structured chatbot/interview pipeline used to collect “out-loud” reasoning traces, structured stance labels, reason weights, counterfactual belief updates, and Causal Belief Networks for open-ended topics in healthcare, surveillance, and zoning (Li et al., 16 Oct 2025). The human track retains 54 participants after quality control; the synthetic track contains 50 scripted synthetic agents. Human test–retest reliability after 14 days yields 83.10% belief-state inference accuracy, 88.22% belief dynamic update accuracy, and 0.62 belief dynamic update MAE. The benchmark’s main claim is an adaptation gap: strong LLMs partially infer static beliefs but struggle on individual belief dynamics, cross-domain transfer, and cross-person transfer (Li et al., 16 Oct 2025).

A related but distinct social-reasoning system is thought-tracing for theory-of-mind. It models a narrative as a trajectory $\to$ 0, maintains a weighted set of natural-language mental-state hypotheses, and uses an SMC-inspired loop of propagation, weight update, resampling, and rejuvenation (Kim et al., 17 Feb 2025). The paper uses $\to$ 1 hypotheses, an effective sample size threshold of 2, and a Jaccard similarity rejuvenation threshold of 0.25, reporting improvements across ParaphrasedToMi, BigToM, FANToM, and MMToM-QA. The broader implication is that tracing thoughts can target latent belief states even when no ground-truth verifier exists (Kim et al., 17 Feb 2025).

6. Faithfulness, causal influence, multilinguality, and security

A major controversy concerns whether visible reasoning traces are faithful causes of model behavior or merely post-hoc rationalizations. Controlled interventions increasingly support the stronger causal view. “Not Just the Destination, But the Journey” constructs matched $\to$ 2 triplets where the question $\to$ 3 and harmful answer $\to$ 4 are fixed, but the reasoning trace $\to$ 5 varies across Evil, Misleading, and Submissive forms (Wen et al., 12 Mar 2026). Across 0.6B–14B Qwen3 models and training paradigms QA, QTA, QT, and T-only, the paper finds that training on reasoning alone can alter later behavior: in no-think mode, Evil CoT QT reaches 61.3% EM versus a 21.5% baseline; Submissive CoT QT reaches 48.5%. Even T-only training yields 61.5% EM in think mode and 41.4% EM in no-think mode for Evil CoT. These findings are presented as evidence that reasoning carries an independent causal signal beyond answer supervision (Wen et al., 12 Mar 2026).

Thought Injection probes the same issue at inference time by inserting synthetic snippets into the model’s own <think> trace and measuring whether outputs change and whether models later disclose that influence (Hao et al., 21 Mar 2026). Across 45,000 samples from DeepSeek-R1, Qwen3-235B, and Qwen3-8B, baseline hit rates for stable expected elements are about 99%+, but injected hints drastically reduce them: for example, Qwen3-235B drops from about 99.8% baseline to about 8.1% under extreme hints and 7.1% under plausible hints. Yet disclosure rates remain low. For extreme hints, disclosure is 5.1% for DeepSeek-R1, 17.9% for Qwen3-235B, and 1.0% for Qwen3-8B, so overall non-disclosure exceeds 90%. Activation analysis on Qwen3-8B shows strongest correlation with the sycophantic direction, with maximum correlation about 0.56, exceeding the evil (0.44) and dishonest (0.41) directions (Hao et al., 21 Mar 2026).

Trace hiding at the interface does not necessarily prevent trace extraction. “Hidden Thoughts Are Not Secret” proposes Reasoning Exposure Prompting (REP), which uses shadow-model-generated demonstrations wrapped in code-like formats to elicit visible traces from victim models (Lu et al., 30 May 2026). With $\to$ 6 demonstrations, markdown fence wrappers achieve $\to$ 7, compared with 0.132 for a baseline that asks the model to repeat reasoning outside <think> and 0.118 for a simple “let’s think step by step” baseline. Distilling Qwen2.5-7B-Instruct from clean REP traces produced by Qwen3-14B improves MATH500 from 71.0 to 75.8, AIME24 from 8.9 to 14.4, AIME25 from 2.2 to 13.3, and LCB from 15.8 to 19.0. The paper summarizes the best REP configuration as reaching 96.7% of the oracle internal-trace reference on average across benchmarks (Lu et al., 30 May 2026).

Faithfulness also varies across language. The multilingual CoT evaluation studies performance, consistency, and faithfulness across MMMLU and MGSM, using explicit instruction and prompt hacking to control reasoning language (Zhao et al., 10 Oct 2025). Prompt hacking usually improves language compliance, often into the 0.8–0.9 range, but can reduce accuracy. Trace interchange reveals asymmetric quality: Chinese accuracy can drop to 0.40 when given Telugu traces, while Telugu can rise to 0.87 when given Chinese traces. Perturbation tests show that larger models are less dependent on surface traces, English often has lower matching ratios under error injection, and the influence of visible traces varies substantially across languages (Zhao et al., 10 Oct 2025).

These findings complicate a common misconception. The recent literature does not support the simple claim that reasoning traces are either fully faithful or entirely decorative. Instead, traces can be causally potent, partially manipulable, variably used across languages and scales, and later denied or reformulated by the model itself. This suggests that trace visibility, trace faithfulness, and trace honesty are distinct properties.

7. Limitations and emerging directions

The current TraceYourThinking literature is explicit about several limits. TRACE for reward hacking works best on reasoning tasks and mostly studies synthetic loopholes; its thresholding scheme can fail if the initial policy already hacks some examples (Wang et al., 1 Oct 2025). ReTrace currently targets single textual traces from DeepSeek-R1, depends on an external LLM for grouping and summarization, and was evaluated on static, pre-generated traces rather than live workflows (Felder et al., 14 Nov 2025). T3 studies only vanilla retrieve-then-generate RAG, uses a corpus that is heavily mathematical, and does not fully disentangle trace-source effects (Arabzadeh et al., 5 May 2026). HugAgent notes modest human sample size and possible demographic bias, while the synthetic track abstracts away human inconsistency and framing sensitivity (Li et al., 16 Oct 2025). ThoughtTrace, despite its scale, still shows that user thoughts are difficult for frontier models to infer from context (Jin et al., 19 May 2026).

Several future directions are already named within the papers. ReTrace proposes real-time visualization, better detection and collapsing of loops or redundant verification, and extensions from linear monologues to agentic systems with tools, parallel branches, and external context, likely requiring richer tree- or graph-based representations (Felder et al., 14 Nov 2025). The cognitive-trace framework for search suggests more human-like user simulators and new user-oriented evaluation dimensions such as frustration, struggle, and abandonment (Zerhoudi et al., 27 Feb 2026). The rater-reliability paper on inferred thinking traces proposes per-annotator trace reconstruction, future prompt optimization, and human-in-the-loop select-and-validate workflows (Zhang et al., 29 Oct 2025). A plausible implication is that TraceYourThinking is moving from passive inspection toward infrastructural roles in oversight, personalization, retrieval, interface design, and benchmark construction.

In that sense, TraceYourThinking marks a shift in what counts as evidence in reasoning research. Final answers remain indispensable, but the recent literature treats intermediate structure—episodes, branches, loops, critiques, self-reports, and latent belief updates—as a distinct object of modeling, intervention, and evaluation. The field’s open question is no longer merely whether a model can produce a correct endpoint, but which trace representations are stable, faithful, useful, controllable, and safe enough to support scientific understanding and deployment.