Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 91 tok/s
Gemini 3.0 Pro 46 tok/s Pro
Gemini 2.5 Flash 148 tok/s Pro
Kimi K2 170 tok/s Pro
Claude Sonnet 4.5 34 tok/s Pro
2000 character limit reached

CoT-RAFT Fine-Tuning Objective

Updated 7 November 2025
  • The paper introduces a composite loss that integrates regression-aware calibration with chain-of-thought cross-entropy to address numeric prediction and reasoning alignment.
  • It employs a two-stage training process: initial fine-tuning with external CoTs followed by alignment using self-generated CoTs to mitigate distribution mismatch.
  • Empirical validation shows significant robustness and improved Pearson correlation scores compared to standard methods, confirming the method's effectiveness.

The CoT-RAFT fine-tuning objective denotes a class of training paradigms that synthesize Chain-of-Thought (CoT) reasoning supervision with regression-aware or ranking-based learning signals, with particular relevance for LLM-as-a-judge and robust domain adaptation tasks. Originally motivated by the misalignment between standard cross-entropy (CE) loss and the requirements of numeric prediction or faithful stepwise reasoning, CoT-RAFT objectives have evolved to combine CE for rationales and scores, regression-calibrated loss terms, as well as staged or contrastive strategies, validated in empirical and ablation studies (Chiang et al., 6 Mar 2025).

1. Foundations of CoT-RAFT Objective

CoT-RAFT arises as a solution to two distinct but related gaps in LLM fine-tuning for judging, scoring, or complex reasoning tasks:

  • Standard Practice Limitation: Most supervised fine-tuning employs cross-entropy loss for maximizing the likelihood of annotated outputs—typically concatenated CoT rationales and illustrated scores (e.g., 1–5 rating). This penalizes all categorical mispredictions equally, disregarding numerical structure (i.e., the cost of predicting 5 when 1 is correct equals that of predicting 2 when 1 is correct).
  • Numeric Calibration Deficiency: Since LLM outputs form distributions over string-encoded scores, ignorance of inter-label distances produces poorly calibrated outputs.
  • Reasoning Faithfulness: Existing regression-aware training often omits CoT supervision, undermining the traceability and calibration of model judgments.

The CoT-RAFT objective directly incorporates both the regression-aware and reasoning-aware requirements into a composite loss function, with structured two-stage training to mitigate distributional mismatch between annotator and self-generated reasoning.

2. Mathematical Formulation

Let xx be the input, ss the CoT reasoning trace, yYRy \in \mathcal{Y} \subset \mathbb{R} the score, and p(x)p(\cdot|x) the LLM output distribution.

Regression-Aware Loss (RAFT)

The regression-aware loss for numeric prediction utilizes the RAIL predictor:

y^RAIL(x)=yYp(str(y)x)y\hat{y}_{\rm RAIL}(x) = \sum_{y \in \mathcal{Y}} p(\text{str}(y) \mid x) \cdot y

RAFT(y,p)=(y^RAIL(x)y)2\ell_{\rm RAFT}(y^*, p) = (\hat{y}_{\rm RAIL}(x) - y^*)^2

Chain-of-Thought Cross-Entropy Loss

CoT supervision relies on the categorical likelihood of the annotated rationale-plus-score:

CoT(y,s,p)=logp([s,y]x)\ell_{\rm CoT}(y^*, s^*, p) = -\log p([s^*, y^*] \mid x)

Composite CoT-RAFT Loss

The core innovation combines both elements into a unified training objective:

CoT-RAFTλ(y,pt,p)=λ(yYp(str(y)[x,s^])yy)2logp([s^,y]x),s^pt(x)\boxed{ \begin{align} \ell_{\rm CoT\text{-}RAFT}^\lambda(y^*, p_t, p) = \lambda \left( \sum_{y \in \mathcal{Y}} p(\text{str}(y) \mid [x, \hat{s}]) \cdot y - y^* \right)^2 - \log p([\hat{s}, y^*] \mid x),\qquad \hat{s} \sim p_t(\cdot \mid x) \end{align} }

where λ\lambda is a weighting coefficient, and the conditioning on [x,s^][x, \hat{s}] requires explicit access to the reasoning trace at each training step.

3. Two-Stage Training Dynamics

The CoT-RAFT framework employs a two-stage procedure:

  • Stage 1 (Annotator CoT Training): Initialize from base LLM (p0p_0), fine-tune on externally annotated CoTs (e.g., GPT-4-generated), optimizing the composite loss.
  • Stage 2 (Self-Generated CoT Alignment): Revert to p0p_0, generate CoTs on all xx with the Stage 1 model, constructing a dataset where the reasoning aligns with the model’s own generation at inference. Fine-tune again with the CoT-RAFT loss, now conditioning on self-generated rationales.

This sequence is motivated by the observation that direct fine-tuning on external CoTs degrades generalization, whereas retraining on self-generated CoTs aligns training/inference distributions, sharply improving calibration (Chiang et al., 6 Mar 2025). Crucially, Stage 2 starts from the seed model, not the Stage 1 checkpoint; ablation studies show performance collapse otherwise.

4. Empirical Validation and Performance Analysis

In LLM-as-a-judge benchmarks, TRACT—implementing the CoT-RAFT objective—delivers state-of-the-art performance:

  • Pearson correlation: On Mistral-7B, TRACT achieves average r=0.650r = 0.650, surpassing standard CoT decoding (0.557), RAFT (0.557), and strong baselines (Prometheus-2-7B at 0.591).
  • Robustness: Gains persist under limited compute (few CoT samples) and across weightings (λ\lambda), evidencing stable optimization.
  • Ablations:
    • Self-CoT fine-tuning is critical: Training only on annotator CoTs reduces Pearson’s rr by 0.094.
    • Removing regression-aware loss decreases rr by 0.033.
    • Training on self-CoTs with only CE loss performs worse than external CoTs (0.521 vs. 0.557), highlighting the necessity of simultaneous regression-aware and CoT supervision.

Models generalized equally well to training and test CoTs when trained via the full two-stage process.

CoT-RAFT formalizes the joint supervision of reasoning and numeric calibration in LLM scoring, distinguishing itself from:

  • Pure regression-aware fine-tuning (RAFT): Penalizes numeric error, but omits explanatory traces.
  • Vanilla CoT supervised fine-tuning: Lacks sensitivity to numeric distance in scoring.
  • Contrastive, multi-objective, or iterative reward-ranked RAFT extensions: May add further structure, but TRACT uniquely demonstrates large improvements for CoT+score tasks under task-relevant rubrics (Chiang et al., 6 Mar 2025).

In broader context, similar composite objectives are being explored for translation (Hu et al., 3 Oct 2024), multi-domain adaptation, and code repair.

6. Implementation Considerations

  • Conditioning on Reasoning: At inference, scoring is carried out by sampling a model-generated CoT and applying regression-aware inference (CoT-RAIL).
  • Choice of λ\lambda: The regression-aware and reasoning losses can be balanced; the paper finds performance stable for λ\lambda near 1.
  • Data Distribution: Self-generated CoTs must be re-aligned each time the model architecture or training corpus shifts, as distribution mismatch degrades calibration.
  • Resource Requirements: TRACT has practical compute requirements, as self-CoT generation is conducted only once for the offline training corpus.

7. Summary Table: TRACT Loss Components

Component Loss Term Role
Regression-aware Squared error on score, Numeric calibration
conditioned on CoT
CoT Cross-Entropy Likelihood for rationale+score Reasoning alignment

Fine-tuning objective:

CoT-RAFTλ(y,pt,p)=λ(yYp(str(y)[x,s^])yy)2logp([s^,y]x)\ell_{\rm CoT\text{-}RAFT}^\lambda(y^*, p_t, p) = \lambda \left( \sum_{y \in \mathcal{Y}} p(\text{str}(y) \mid [x, \hat{s}]) \cdot y - y^* \right)^2 - \log p([\hat{s}, y^*] \mid x)

8. Concluding Perspective

The CoT-RAFT fine-tuning objective, as implemented in TRACT (Chiang et al., 6 Mar 2025), successfully integrates regression-aware supervision and stepwise reasoning in a principled, empirically validated framework for LLM-based automated judgment. The two-stage process, balancing numeric calibration and reasoning faithfulness, eliminates miscalibration, mitigates distributional mismatch, and achieves superior performance across scoring and explanation metrics. Ablations confirm each component’s necessity; TRACT stands as the reference method for applications demanding interpretable, accurately scored model judgments.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to CoT-RAFT Fine-tuning Objective.