ThinkPO: Optimizing LLM Reasoning

Updated 14 March 2026

ThinkPO is a framework that optimizes LLM reasoning by internalizing preferences to generate more comprehensive and effective chain-of-thought outputs.
It leverages direct preference optimization, heuristic-guided hierarchical planning, and latent thought policy methods to improve reasoning depth and adaptability.
Empirical results show ThinkPO boosts accuracy by up to 15.4% and increases chain-of-thought lengths significantly, enhancing performance on diverse benchmarks.

ThinkPO (Thinking Preference Optimization) encompasses a class of methodologies designed to enhance the reasoning capabilities of LLMs, with particular emphasis on optimizing multi-step chain-of-thought (CoT) reasoning. The term ThinkPO is associated with multiple lines of research addressing reasoning at various phases—training time, post-supervised fine-tuning, and test time—through direct preference optimization, hierarchical planning in partially observable Markov decision processes, and test-time policy evolution in both parameter and latent spaces. Collectively, these approaches aim to internalize preferences for richer, more effective reasoning, overcoming performance plateaus and increasing robustness without incurring significant data or computational overhead (Yang et al., 17 Feb 2025, Liu, 2024, Ye et al., 5 Oct 2025, Jiao et al., 28 Jan 2026).

1. Foundations and Motivation

The development of ThinkPO is motivated by two central barriers in LLM reasoning: (i) limited access to high-quality long CoT trajectories due to data annotation cost, and (ii) the inability of static or "frozen" policies to adapt at inference time, which stymies deep reasoning and leads to brittle performance on challenging, out-of-distribution tasks. Repeated supervised fine-tuning (SFT) with finite datasets quickly reaches a plateau in both reasoning accuracy and CoT length. ThinkPO frameworks systematically overcome these limitations by either post-hoc preference optimization between long and short CoTs (training phase) or by enabling instance-specific policy adaptation (test-time), often through principled online optimization protocols.

2. Direct Preference Optimization for Reasoning (DPO)

The canonical ThinkPO method introduced by Wang et al. applies direct preference optimization to extend the benefits of SFT without requiring new annotations (Yang et al., 17 Feb 2025). Let $\theta$ be the model parameters. The method leverages a dataset of "chosen" (long) CoT traces $D_\mathrm{sft}=\{(q_i, o^{\mathrm{long}}_i)\}$ and generates "rejected" (short) counterparts $o^{\mathrm{short}}_i$ using a smaller, less capable model. The objective is to drive the model to prefer the long, more comprehensive CoT outputs via the DPO loss:

$\mathcal{L}_\mathrm{DPO}(\theta) = -\mathbb{E}_{(q, o^\mathrm{long}, o^\mathrm{short})\in D_\mathrm{dpo}}[\log \sigma(\beta\, \Delta_\theta(q, o^\mathrm{long}, o^\mathrm{short}))]$

where $\Delta_\theta$ is the log-probability gap between long and short outputs and $\sigma$ is the logistic function.

Empirically, ThinkPO delivers consistent performance gains: on MATH500, SFT alone yields 82.8% accuracy on Qwen-2.5-7B, while ThinkPO lifts this to 83.4% (+0.7%), increasing average CoT length by 35%. On highest-difficulty benchmarks (AIME2024), ThinkPO achieves a 33.5% relative gain over SFT, and boosts public distilled models (e.g., DeepSeek-R1-Distill-Qwen-7B from 87.4% to 91.2% on MATH500) (Yang et al., 17 Feb 2025). The method also generalizes across model scales (3B, 7B, 14B), consistently improving reasoning depth.

3. Hierarchical Planning: PoT/Plan of Thoughts

"Plan of Thoughts" (abbreviated as PoT or ThinkPO in some literature) recasts reasoning as solving a partially observable Markov decision process (POMDP), integrating LLM self-reflection with heuristic-guided planning (Liu, 2024). The key elements are:

State space: Pairs $(s_\mathrm{sub}, u)$ , where $s_\mathrm{sub}$ is the partial solution and $u$ is an unobservable proxy for solution value.
Actions: {\textsc{continue}, \textsc{rollback}, \textsc{think}}, permitting forward progress, backtracking, or generation of new thoughts.
Observations: LM-generated value judgments (e.g., "sure", "likely", "impossible") used as admissible heuristics.
Planning: POMCP (Partially Observable Monte Carlo Planning) builds an action-observation tree, propagating Q-values with UCT selection and LLM rollouts as evaluators.
Anytime property: The approach yields incrementally improving solutions with additional computation.

On the "Game of 24" benchmark, PoT achieves 89.4% success, significantly outperforming alternative multi-step planning methods such as Tree of Thoughts (74% with breadth=5) and outperforming zero-shot CoT by over 20-fold. The algorithm is data-efficient, robust to interruptions, and provides clear epistemic primitives for introspection and search (Liu, 2024).

4. Test-Time Policy Evolution

The Policy of Thoughts (PoT) framework (Jiao et al., 28 Jan 2026) advances ThinkPO by transforming test-time reasoning into a per-instance online policy optimization process. This is motivated by the "frozen policy" instability: static policies cannot recover from early errors in long-horizon reasoning tasks. Drawing from Popper's cycle (conjectures and refutations), PoT alternates between:

Exploratory Conjecture: Monte Carlo Tree Search (PUCT selection) generates candidate solutions; each node is a partial reasoning state.
Policy Internalization: Execution feedback (e.g., unit tests) scores each trajectory, and Group Relative Policy Optimization (GRPO) updates a transient LoRA adapter, enabling dynamic, instance-specific refinement of reasoning priors.

The complete process is formalized as an MDP. LoRA adapters inject rank- $r$ updates to the backbone weights, parameterized separately per instance. Experimental results on code reasoning benchmarks (HumanEval, MBPP, LiveCodeBench, ICPC) demonstrate PoT achieving 58.98% accuracy (4B LLM), exceeding commercial closed-source models such as GPT-4o and providing 22%+ absolute gains over search-only baselines. PoT’s primary driver is the internalization step (LoRA+GRPO), highlighting the value of closed-loop adaptation at inference (Jiao et al., 28 Jan 2026).

5. Latent Thought Policy Optimization

Recent directions in ThinkPO leverage latent reasoning spaces instead of explicit token-based CoTs (Ye et al., 5 Oct 2025). In LTPO, the frozen LLM's intermediate "thought" vectors are introduced as special tokens with tunable hidden-state embeddings. At test time, these latent vectors are optimized per instance via policy-gradient (REINFORCE):

State: Current thought vectors $D_\mathrm{sft}=\{(q_i, o^{\mathrm{long}}_i)\}$ 0.
Policy: Gaussian over vector space, enabling perturbations.
Reward: Model's own softmax token prediction confidence (mean negative log-likelihood over top- $D_\mathrm{sft}=\{(q_i, o^{\mathrm{long}}_i)\}$ 1 outputs).
Update: Ascend in $D_\mathrm{sft}=\{(q_i, o^{\mathrm{long}}_i)\}$ 2 space to maximize intrinsic confidence, keeping LLM weights fixed.

LTPO achieves significant robustness: on challenging benchmarks where static latent reasoning (e.g., SoftCoT) collapses (0% on AIME), LTPO maintains 13–17% accuracy and consistently outperforms both zero-shot CoT and latent baselines on MATH-500, GSM8K, and ASDiv-Aug (Ye et al., 5 Oct 2025). Key efficiency advantages stem from the absence of autoregressive decode in the inner loop, as only forward passes on fixed input length are needed.

6. Analysis, Limitations, and Future Directions

ThinkPO methods collectively overcome major LLM reasoning bottlenecks, but several caveats remain:

Hyperparameter Sensitivity: DPO-based ThinkPO requires careful margin and learning rate tuning; aggressive settings can destabilize training (Yang et al., 17 Feb 2025).
Proxy Bias: Using output length as a proxy for reasoning depth may overemphasize verbosity unless combined with structural or entailment-based preference models.
Reward Alignment: Latent confidence-based rewards in LTPO risk convergence to confidently incorrect solutions, underscoring the need for hybrid objectives leveraging external constraints or weak supervision (Ye et al., 5 Oct 2025).
Limited Generality: PoT and GRPO methods rely on executable feedback (unit tests, verification signals); performance on open-ended or reward-model-based tasks is less pronounced (Jiao et al., 28 Jan 2026).
Transience vs. Persistence: Most test-time adaptations (LoRA, latent vector) are reset between instances; mechanisms for persistent or transferable adaptations remain largely unexplored.

Research horizons include step-wise or hierarchical preference optimization, integration of human-in-the-loop or entailment-based scores, multimodal task extension, and persistent adapter or trajectory evolution mechanisms.

7. Summary of Benchmark Results

Approach / Model	Key Task	SFT / Baseline Acc.	+ThinkPO Acc.	Δ Acc.	Output Length Δ
Qwen-2.5-7B (+SFT)	MATH500	82.8%	83.4%	+0.7%	+35.0%
DeepSeek-R1-Distill-Qwen-7B	MATH500	87.4%	91.2%	+4.3%	+17.2%
Qwen3-4B, PoT	LiveCodeBenchv6	37.14% (search-only)	49.71%	+12.57%	N/A
Plan of Thoughts	Game of 24	74% (ToT breadth=5)	89.4%	+15.4%	N/A
LTPO (Latent)	AIME2024	10.0% (CoT)	16.7%	+6.7%	N/A

These results demonstrate that ThinkPO methodologies, spanning preference optimization, planning-based reasoning, and test-time adaptation, deliver substantial, architecture-agnostic improvements across diverse reasoning benchmarks.

References:

(Yang et al., 17 Feb 2025): "Thinking Preference Optimization" (Liu, 2024): "Plan of Thoughts: Heuristic-Guided Problem Solving with LLMs" (Ye et al., 5 Oct 2025): "Thinking on the Fly: Test-Time Reasoning Enhancement via Latent Thought Policy Optimization" (Jiao et al., 28 Jan 2026): "Policy of Thoughts: Scaling LLM Reasoning via Test-time Policy Evolution"

Markdown Report Issue Upgrade to Chat

References (4)

Thinking Preference Optimization (2025)

Plan of Thoughts: Heuristic-Guided Problem Solving with Large Language Models (2024)

Thinking on the Fly: Test-Time Reasoning Enhancement via Latent Thought Policy Optimization (2025)

Policy of Thoughts: Scaling LLM Reasoning via Test-time Policy Evolution (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ThinkPO.

ThinkPO: Optimizing LLM Reasoning

1. Foundations and Motivation

2. Direct Preference Optimization for Reasoning (DPO)

3. Hierarchical Planning: PoT/Plan of Thoughts

4. Test-Time Policy Evolution

5. Latent Thought Policy Optimization

6. Analysis, Limitations, and Future Directions

7. Summary of Benchmark Results

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

ThinkPO: Optimizing LLM Reasoning

1. Foundations and Motivation

2. Direct Preference Optimization for Reasoning (DPO)

3. Hierarchical Planning: PoT/Plan of Thoughts

4. Test-Time Policy Evolution

5. Latent Thought Policy Optimization

6. Analysis, Limitations, and Future Directions

7. Summary of Benchmark Results

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research