Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 104 tok/s
Gemini 3.0 Pro 36 tok/s Pro
Gemini 2.5 Flash 133 tok/s Pro
Kimi K2 216 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

TRACT: Two-stage Regression-Aware Fine-tuning

Updated 7 November 2025
  • The paper demonstrates that combining CoT explanations with regression-aware loss boosts numerical prediction accuracy, achieving Pearson r up to 0.65.
  • The methodology uses a two-stage approach—first fine-tuning on annotated CoTs and then on self-generated ones—to resolve training-inference mismatches.
  • The practical implications include robust performance for tasks like feedback grading and dialogue assessment, ensuring scalable model deployment.

Two-stage Regression-Aware Fine-tuning with CoT Reasoning (TRACT) is a training paradigm that integrates Chain-of-Thought (CoT) reasoning supervision with regression-aware objectives, achieving robust and accurate numerical prediction in LLM evaluation tasks. The TRACT framework addresses limitations of standard cross-entropy fine-tuning, which neglects the numeric nature of tasks such as model-as-a-judge, and overcomes key distribution mismatches between training and inference phases.

1. Motivation and Problem Setting

The LLM-as-a-Judge paradigm requires models to output not just stepwise reasoning chains (CoT explanations) but also a numerical score according to specified rubrics. Standard fine-tuning employs cross-entropy (CE) loss to maximize the likelihood of ground-truth responses, including CoT explanations and scores, but this ignores the regression nature of the score prediction. Regression-aware fine-tuning (RAFT) corrects this by using squared error objectives, but does not leverage explicit reasoning via CoT supervision.

Crucially, neither CE with CoT nor regression-aware objectives alone are sufficient for high-fidelity automated evaluation in settings where both nuanced reasoning and precise numeric estimations are required (e.g., feedback grading, model comparison, multi-turn dialogue assessment).

2. TRACT Framework: Two-Stage Regression-Aware CoT Fine-Tuning

TRACT combines both CoT reasoning and regression-aware training, executed in two explicit stages to address distribution mismatches and maximize both reasoning fidelity and numeric accuracy:

  1. Stage 1 – Annotation CoT Fine-tuning:
    • Fine-tune the base LLM using high-quality annotated CoT explanations (typically from GPT-4 or expert annotators) paired with ground-truth scores.
    • Train with a combined loss: CE for CoT reasoning, RAFT squared error for score prediction.
  2. Stage 2 – Self-Generated CoT Fine-tuning:
    • Use the model from Stage 1 to generate new CoT explanations for each training input.
    • Pair these self-generated CoTs with the annotated scores to create a new training set.
    • Fine-tune a fresh copy of the base LLM on these self-generated pairs using the same mixed objective.

The two-stage approach ensures that models learn to produce and utilize their own reasoning traces at inference, thus matching the training and inference distributions and resolving the mismatch inherent in annotation-based training.

3. Mathematical Formulation and Training Objective

Let xx denote the input context (instruction, rubric), ss denote the CoT explanation (sequence), y{1,...,5}y \in \{1, ..., 5\} the numeric score, and p(x)p(\cdot|x) the LLM output distribution.

CoT-RAIL Inference

At test time, for each input xx:

  1. Sample a CoT explanation: s^p(x)\hat{s} \sim p(\cdot|x)
  2. Predict score as expectation under the posterior:

y^CR(x)=yYp(str(y)[x,s^])y\hat{y}_{\text{CR}(x)} = \sum_{y \in \mathcal{Y}} p(\text{str}(y) | [x, \hat{s}]) \cdot y

CoT-RAFT Training Objective

TRACT employs a weighted sum of regression-aware squared error loss and cross-entropy for CoT:

CoTRAFTλ(y,pt,p)=  λ(yYp(str(y)[x,s^])yy)2logp([s^,y]x)\ell_{\rm CoT-RAFT}^{\lambda}(y^*, p_{\rm t}, p) =\; \lambda \left( \sum_{y \in \mathcal{Y}} p(\text{str}(y) | [x, \hat{s}]) \cdot y - y^* \right)^2 - \log p([\hat{s}, y^*] | x)

where λ\lambda is the mixing coefficient, and s^pt(x)\hat{s} \sim p_{\rm t}(\cdot|x) is sampled during Stage 1 from the annotation LLM, and during Stage 2 from the Stage 1 model.

4. Empirical Results and Ablation Studies

TRACT was evaluated across four LLM-as-a-Judge benchmarks (Feedback Bench, FLASK, Vicuna Bench, MT Bench) and with both Mistral-7B-Instruct-v0.2 and Llama-3.1-8B-Instruct as base models.

  • Performance: TRACT achieves state-of-the-art Pearson correlation (r = 0.65, Llama r = 0.675) versus Prometheus-2-7B (r = 0.591) and pure CE CoT training (r = 0.557), surpassing baselines by substantial margins.
  • Robustness: TRACT is robust to the number of sampled CoT traces at inference, performing well even with single trace sampling.
  • Ablations:
    • Omitting Stage 2 (i.e., training only on annotation CoTs) results in a drop of 0.094 in Pearson r.
    • Replacing CoT-RAFT with pure CE loss reduces performance by 0.033.
    • Training on self-generated CoTs with CE loss alone is inferior to annotation-based training—self-generation is effective only when paired with regression-aware objectives.
    • Starting Stage 2 from Stage 1 rather than base model leads to catastrophic overfitting and performance collapse (r = 0.515).
  • Distribution Matching: TRACT resolves the distribution mismatch between training and inference, evidenced by diagnostic experiments showing that annotation-only CoT training degrades at inference on self-generated traces.

Meta-evaluation via GPT-4 indicates self-generated CoTs are almost as high-quality as annotation CoTs (4.50 vs 4.78/5).

5. Comparative Analysis with Baselines

The following table summarizes the unique assets and performance of TRACT relative to key baselines:

Approach Uses CoT Regression-aware Loss Self-Generated CoT Avg. Pearson rr
Standard CE-no-CoT n/a 0.488
Standard CE w/ CoT n/a 0.557
RAFT, no CoT n/a 0.623
Prometheus-2-7B n/a 0.591
TRACT (Ours) 0.650

TRACT uniquely combines reasoning supervision and regression objectives in both annotation and self-generative stages, outperforming both token-level and regression-centric alternatives.

6. Practical Trade-Offs, Implementation, and Deployment

  • TRACT produces robust, distribution-matched models ready for deployment in evaluation tasks requiring rated or scored outputs with explicit reasoning justification.
  • The two-stage process is efficient and scalable; self-generated CoTs are high-quality and avoid catastrophic distributional shifts.
  • TRACT remains effective under inference-time compute constraints, allowing for minimal CoT sampling.
  • The mixed loss coefficient λ\lambda can be tuned; results show robustness over a wide range of values.

Unique implementation insights:

  • The necessity to re-initialize Stage 2 from the base model avoids latent overfitting to annotation distributions.
  • Regression-aware loss is vital to preserve numeric accuracy, not just reasoning fluency.

7. Connections to Broader Regression-Aware and CoT Fine-Tuning Paradigms

TRACT is emblematic of advanced regression-aware and multi-phase CoT fine-tuning strategies, being:

  • Modular with respect to data source (annotation or self-generated),
  • Directly tied to the mechanisms uncovered in studies analyzing the stage-wise alignment of reasoning traces and internal representations (Yao et al., 7 Feb 2025),
  • A functional blueprint for scaling systematic generalization and robust reasoning into scoring and model-evaluation contexts.

8. Summary and Outlook

Two-stage Regression-Aware Fine-tuning with CoT (TRACT) provides an architecture-agnostic recipe for reasoning-intensive tasks where numeric regression and stepwise explanations are required. Its explicit CoT supervision, regression-aware score modeling, and self-distribution matching unlock both in-distribution and out-of-distribution generalization, as well as deployment scalability. TRACT sets a new standard for model-based evaluation, demonstrating the synergy of reasoning and numeric supervision for both research and applied domains.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Two-stage Regression-Aware Fine-tuning with CoT Reasoning (TRACT).