First-thought Prefix in LLMs

Updated 20 November 2025

First-thought Prefix is a phenomenon in LLMs where the initial tokens provide a strong unsupervised signal that guides accurate reasoning.
UPFT leverages a method of fine-tuning on these early tokens using MLE loss, which dramatically reduces sampling and tuning costs.
Empirical studies show that this approach improves reasoning accuracy by 2–4 points on benchmarks while using far fewer computational resources.

The term "First-thought Prefix" encapsulates a phenomenon and method in contemporary neural reasoning models, particularly LLMs, where the initial segment (“prefix”) of a model’s generated solution trajectory carries a disproportionately strong unsupervised signal for correct reasoning. This principle, formalized as "Prefix Self-Consistency," is operationalized in Unsupervised Prefix Fine-Tuning (UPFT), a framework that exploits high agreement in these early tokens to attain sample-efficient, resource-minimal improvements in LLM reasoning capability. The approach stands in contrast to conventional supervised fine-tuning and other prefix-conditioning paradigms by enabling instruction-following and robust reasoning advances with minimal or no human-verified targets (Ji et al., 4 Mar 2025).

1. Prefixes and Prefix Self-Consistency

Let $x$ denote a task or problem instance (e.g., an arithmetic query). For a model parameterized by $\theta$ , a chain-of-thought (CoT) trajectory $r = (r_1, r_2, ..., r_T)$ is produced by sampling from $p_\theta(r | x)$ . The $k$ -prefix $r_{<k}$ is the sub-sequence $(r_1, ..., r_k)$ . Given a set of $M$ trajectories $S_x = \{ r^{(i)} \}_{i=1}^M$ , the observed $k$ -prefixes are $P_k(x) = \{ r_{<k}^{(i)} : i = 1...M \}$ .

Prefix Self-Consistency denotes that, for small $k$ (8–16 tokens in practice on mathematical reasoning tasks), a single $k$ -prefix typically dominates in coverage among $P_k(x)$ . Coverage $\text{cov}_x(p) = |\{ i : r_{<k}^{(i)} = p \}| / M$ for a prefix $p$ . Empirically, the most frequent $k$ -prefix covers a significant proportion (often tens of trajectories out of a thousand samples), far more than random chance. Notably, the continuation accuracy from shared prefixes is as high as that from correct trajectories, indicating the early tokens encode essential structural reasoning information regardless of final correctness. This self-consistency enables accurate unsupervised gradient signals for fine-tuning with minimal risk of bias towards erroneous continuation patterns (Ji et al., 4 Mar 2025).

2. UPFT Training Objective and Data Partitioning

UPFT starts from a variational lower bound for $\log p_\theta(y|x)$ , decomposing the reasoning trace $r$ into a prefix $r_{<k}$ and a suffix $r_{\geq k}$ . The objective separates prefix coverage—model probability of a prefix—and prefix accuracy—expected answer likelihood conditioned on that prefix. UPFT sidesteps the need for labeled targets (answers $y$ ), and instead fine-tunes exclusively on sampled prefixes:

Split an unlabeled corpus $D$ into $D_p$ (prefix set) and $D_f$ (full-trace set) by a structure ratio $p$ (typically $0.1$).
On $D_p$ , sample one $k$ -prefix per $x$ and apply an MLE loss over this segment:

$\mathcal{L}_p(\theta) = - \mathbb{E}_{x \in D_p,\, r_{<k} \sim p_\theta} \left[ \log p_\theta(r_{<k} | x) \right]$

On $D_f$ , sample a full $r$ and apply full-sequence MLE:

$\mathcal{L}_f(\theta) = - \mathbb{E}_{x \in D_f,\, r \sim p_\theta} \left[ \log p_\theta(r | x) \right]$

The total UPFT loss is $\mathcal{L}(\theta) = \mathcal{L}_p(\theta) + \mathcal{L}_f(\theta)$ .

By construction, no answer labels or rejection filtering is involved, markedly reducing the data and computational requirements relative to best-of-N rejection sampling or full-token supervised fine-tuning (Ji et al., 4 Mar 2025).

3. Training Workflow and Computational Efficiency

UPFT implementation is algorithmically straightforward. Beginning with a pre-trained model and a tokenizer, the corpus is split according to the desired prefix length $k$ and structure ratio $p$ . Prefix fine-tuning proceeds via next-token prediction for sampled $k$ -prefixes, while a small fraction of full-trace learning maintains global output structure. Standard ML infrastructure (learning rate, batch size, gradient accumulation, epochs) applies.

Empirical sampling cost, measured on the PRM-12K reasoning dataset with Llama-8B, illustrates UPFT’s efficiency:

Method	#Sampling Tokens	#Tuning Tokens	Time Reduction
Rejection Sampling (RFT)	36.9M	2.3M	Baseline
UPFT	0.2M	0.2M	4–16× faster

This corresponds to approximately 99% reduction in sampling cost and 90% reduction in tuning tokens, while eliminating the need for any answer filtering or reward models (Ji et al., 4 Mar 2025).

4. Empirical Performance and Comparative Analysis

Across four reasoning benchmarks (GSM8K, MATH500, AIME2024, GPQA), UPFT yields average accuracy improvements of 2–4 points over vanilla supervised fine-tuning in unsupervised settings. The effect is accentuated on higher-difficulty datasets (e.g., gains from 6.7% to 20.0% on AIME for Qwen-Math-7B).

Tuning sequence lengths decrease by 80–95%, yielding substantially faster iterations. When compared to standard best-of-N supervised pipelines (RFT, V-STaR), UPFT matches or closely approaches their performance even when gold answers are available, while using 50–100× fewer sampling tokens and 3–20× fewer tuning tokens. This confirms that early tokens—first-thought prefixes—capture the core of model reasoning needed for high-quality performance increments (Ji et al., 4 Mar 2025).

5. Ablation Studies and Analysis

Several critical ablations illuminate UPFT’s robustness and hyperparameter sensitivity:

Prefix length $k$ : Llama-8B peaks at $k=8$ , degrading beyond $k=16$ . Qwen-Math is robust for $k \in [8,32]$ , while long-context models like DeepSeek perform best near $k=128$ . The trade-off is: small $k$ yields high prefix coverage but lower suffix accuracy; large $k$ increases target suffix accuracy but reduces prefix diversity and coverage for training.
Structure ratio $p$ : Best results are found around $p = 0.1$ ; too small undermines output structure, too large reduces UPFT to full-sequence fine-tuning (losing efficiency and the prefix advantage).
Localization of error: Divergence between correct and incorrect trajectories appears after $t \geq 32$ , confirming that prefix-based training does not bias the model towards “incorrect” completions.
Structural preservation: A small amount of full-trace fine-tuning ( $p > 0$ ) suffices to maintain answer formatting and generalization, preventing collapse to truncated outputs (Ji et al., 4 Mar 2025).

6. Implementation Guidance

For effective UPFT deployment:

Prefix length selection: Optimal value depends on model class:
- Generalist 7–8B models: $k \approx 8$
- Math-specialists: $k = 16$ –$32$
- Long-context models: $k \approx 128$
- Perform a small grid search on a validation subset.
Structure ratio: $p \approx 0.1$ ; on very small datasets, consider up to $0.3$.
Single-sample training: Use a single $k$ -prefix or full-trace sample per instance per epoch, with no labels or filtered sampling.
Hyperparameters: Maintain standard fine-tuning values for learning rate, batch size, warmup, and epochs.
Integration: UPFT can be introduced as a pre-fine-tuning “prefix stage” after pre-training or instruction tuning, requiring minimal code adaptation and yielding near-trivial sampling overhead (∼1% of best-of-N setups) (Ji et al., 4 Mar 2025).

7. Theoretical and Practical Implications

The First-thought Prefix principle—Prefix Self-Consistency—demonstrates that the earliest tokens in a model’s reasoning chain are highly informative and shared across plausible solution trajectories. The UPFT regime establishes that this unsupervised signal provides nearly all the benefit obtainable with supervised, sample-intensive pipelines, dramatically reducing the barrier to entry for effective LLM adaptation. Possible extensions include further application to domains where early generative steps possess strong consistency, and generalization to tasks beyond chain-of-thought mathematical reasoning appears plausible but is not directly tested in the cited studies (Ji et al., 4 Mar 2025).

These findings reframe reasoning model adaptation, highlighting the utility of unsupervised, prefix-based objectives and positioning First-thought Prefix methods as a critical technique for scalable, resource-adaptive LLM deployment.

PDF Markdown Chat (Pro)

References (1)

The First Few Tokens Are All You Need: An Efficient and Effective Unsupervised Prefix Fine-Tuning Method for Reasoning Models (2025)

Follow Topic

Get notified by email when new papers are published related to First-thought Prefix.