TRACT: Two-stage Regression-Aware Fine-tuning
- The paper demonstrates that combining CoT explanations with regression-aware loss boosts numerical prediction accuracy, achieving Pearson r up to 0.65.
- The methodology uses a two-stage approach—first fine-tuning on annotated CoTs and then on self-generated ones—to resolve training-inference mismatches.
- The practical implications include robust performance for tasks like feedback grading and dialogue assessment, ensuring scalable model deployment.
Two-stage Regression-Aware Fine-tuning with CoT Reasoning (TRACT) is a training paradigm that integrates Chain-of-Thought (CoT) reasoning supervision with regression-aware objectives, achieving robust and accurate numerical prediction in LLM evaluation tasks. The TRACT framework addresses limitations of standard cross-entropy fine-tuning, which neglects the numeric nature of tasks such as model-as-a-judge, and overcomes key distribution mismatches between training and inference phases.
1. Motivation and Problem Setting
The LLM-as-a-Judge paradigm requires models to output not just stepwise reasoning chains (CoT explanations) but also a numerical score according to specified rubrics. Standard fine-tuning employs cross-entropy (CE) loss to maximize the likelihood of ground-truth responses, including CoT explanations and scores, but this ignores the regression nature of the score prediction. Regression-aware fine-tuning (RAFT) corrects this by using squared error objectives, but does not leverage explicit reasoning via CoT supervision.
Crucially, neither CE with CoT nor regression-aware objectives alone are sufficient for high-fidelity automated evaluation in settings where both nuanced reasoning and precise numeric estimations are required (e.g., feedback grading, model comparison, multi-turn dialogue assessment).
2. TRACT Framework: Two-Stage Regression-Aware CoT Fine-Tuning
TRACT combines both CoT reasoning and regression-aware training, executed in two explicit stages to address distribution mismatches and maximize both reasoning fidelity and numeric accuracy:
- Stage 1 – Annotation CoT Fine-tuning:
- Fine-tune the base LLM using high-quality annotated CoT explanations (typically from GPT-4 or expert annotators) paired with ground-truth scores.
- Train with a combined loss: CE for CoT reasoning, RAFT squared error for score prediction.
- Stage 2 – Self-Generated CoT Fine-tuning:
- Use the model from Stage 1 to generate new CoT explanations for each training input.
- Pair these self-generated CoTs with the annotated scores to create a new training set.
- Fine-tune a fresh copy of the base LLM on these self-generated pairs using the same mixed objective.
The two-stage approach ensures that models learn to produce and utilize their own reasoning traces at inference, thus matching the training and inference distributions and resolving the mismatch inherent in annotation-based training.
3. Mathematical Formulation and Training Objective
Let denote the input context (instruction, rubric), denote the CoT explanation (sequence), the numeric score, and the LLM output distribution.
CoT-RAIL Inference
At test time, for each input :
- Sample a CoT explanation:
- Predict score as expectation under the posterior:
CoT-RAFT Training Objective
TRACT employs a weighted sum of regression-aware squared error loss and cross-entropy for CoT:
where is the mixing coefficient, and is sampled during Stage 1 from the annotation LLM, and during Stage 2 from the Stage 1 model.
4. Empirical Results and Ablation Studies
TRACT was evaluated across four LLM-as-a-Judge benchmarks (Feedback Bench, FLASK, Vicuna Bench, MT Bench) and with both Mistral-7B-Instruct-v0.2 and Llama-3.1-8B-Instruct as base models.
- Performance: TRACT achieves state-of-the-art Pearson correlation (r = 0.65, Llama r = 0.675) versus Prometheus-2-7B (r = 0.591) and pure CE CoT training (r = 0.557), surpassing baselines by substantial margins.
- Robustness: TRACT is robust to the number of sampled CoT traces at inference, performing well even with single trace sampling.
- Ablations:
- Omitting Stage 2 (i.e., training only on annotation CoTs) results in a drop of 0.094 in Pearson r.
- Replacing CoT-RAFT with pure CE loss reduces performance by 0.033.
- Training on self-generated CoTs with CE loss alone is inferior to annotation-based training—self-generation is effective only when paired with regression-aware objectives.
- Starting Stage 2 from Stage 1 rather than base model leads to catastrophic overfitting and performance collapse (r = 0.515).
- Distribution Matching: TRACT resolves the distribution mismatch between training and inference, evidenced by diagnostic experiments showing that annotation-only CoT training degrades at inference on self-generated traces.
Meta-evaluation via GPT-4 indicates self-generated CoTs are almost as high-quality as annotation CoTs (4.50 vs 4.78/5).
5. Comparative Analysis with Baselines
The following table summarizes the unique assets and performance of TRACT relative to key baselines:
| Approach | Uses CoT | Regression-aware Loss | Self-Generated CoT | Avg. Pearson |
|---|---|---|---|---|
| Standard CE-no-CoT | ✗ | ✗ | n/a | 0.488 |
| Standard CE w/ CoT | ✓ | ✗ | n/a | 0.557 |
| RAFT, no CoT | ✗ | ✓ | n/a | 0.623 |
| Prometheus-2-7B | ✓ | ✗ | n/a | 0.591 |
| TRACT (Ours) | ✓ | ✓ | ✓ | 0.650 |
TRACT uniquely combines reasoning supervision and regression objectives in both annotation and self-generative stages, outperforming both token-level and regression-centric alternatives.
6. Practical Trade-Offs, Implementation, and Deployment
- TRACT produces robust, distribution-matched models ready for deployment in evaluation tasks requiring rated or scored outputs with explicit reasoning justification.
- The two-stage process is efficient and scalable; self-generated CoTs are high-quality and avoid catastrophic distributional shifts.
- TRACT remains effective under inference-time compute constraints, allowing for minimal CoT sampling.
- The mixed loss coefficient can be tuned; results show robustness over a wide range of values.
Unique implementation insights:
- The necessity to re-initialize Stage 2 from the base model avoids latent overfitting to annotation distributions.
- Regression-aware loss is vital to preserve numeric accuracy, not just reasoning fluency.
7. Connections to Broader Regression-Aware and CoT Fine-Tuning Paradigms
TRACT is emblematic of advanced regression-aware and multi-phase CoT fine-tuning strategies, being:
- Modular with respect to data source (annotation or self-generated),
- Directly tied to the mechanisms uncovered in studies analyzing the stage-wise alignment of reasoning traces and internal representations (Yao et al., 7 Feb 2025),
- A functional blueprint for scaling systematic generalization and robust reasoning into scoring and model-evaluation contexts.
8. Summary and Outlook
Two-stage Regression-Aware Fine-tuning with CoT (TRACT) provides an architecture-agnostic recipe for reasoning-intensive tasks where numeric regression and stepwise explanations are required. Its explicit CoT supervision, regression-aware score modeling, and self-distribution matching unlock both in-distribution and out-of-distribution generalization, as well as deployment scalability. TRACT sets a new standard for model-based evaluation, demonstrating the synergy of reasoning and numeric supervision for both research and applied domains.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free