TRACT: Two-stage Regression-Aware Fine-tuning

Updated 7 November 2025

The paper demonstrates that combining CoT explanations with regression-aware loss boosts numerical prediction accuracy, achieving Pearson r up to 0.65.
The methodology uses a two-stage approach—first fine-tuning on annotated CoTs and then on self-generated ones—to resolve training-inference mismatches.
The practical implications include robust performance for tasks like feedback grading and dialogue assessment, ensuring scalable model deployment.

Two-stage Regression-Aware Fine-tuning with CoT Reasoning (TRACT) is a training paradigm that integrates Chain-of-Thought (CoT) reasoning supervision with regression-aware objectives, achieving robust and accurate numerical prediction in LLM evaluation tasks. The TRACT framework addresses limitations of standard cross-entropy fine-tuning, which neglects the numeric nature of tasks such as model-as-a-judge, and overcomes key distribution mismatches between training and inference phases.

1. Motivation and Problem Setting

The LLM-as-a-Judge paradigm requires models to output not just stepwise reasoning chains (CoT explanations) but also a numerical score according to specified rubrics. Standard fine-tuning employs cross-entropy (CE) loss to maximize the likelihood of ground-truth responses, including CoT explanations and scores, but this ignores the regression nature of the score prediction. Regression-aware fine-tuning (RAFT) corrects this by using squared error objectives, but does not leverage explicit reasoning via CoT supervision.

Crucially, neither CE with CoT nor regression-aware objectives alone are sufficient for high-fidelity automated evaluation in settings where both nuanced reasoning and precise numeric estimations are required (e.g., feedback grading, model comparison, multi-turn dialogue assessment).

2. TRACT Framework: Two-Stage Regression-Aware CoT Fine-Tuning

TRACT combines both CoT reasoning and regression-aware training, executed in two explicit stages to address distribution mismatches and maximize both reasoning fidelity and numeric accuracy:

Stage 1 – Annotation CoT Fine-tuning:
- Fine-tune the base LLM using high-quality annotated CoT explanations (typically from GPT-4 or expert annotators) paired with ground-truth scores.
- Train with a combined loss: CE for CoT reasoning, RAFT squared error for score prediction.
Stage 2 – Self-Generated CoT Fine-tuning:
- Use the model from Stage 1 to generate new CoT explanations for each training input.
- Pair these self-generated CoTs with the annotated scores to create a new training set.
- Fine-tune a fresh copy of the base LLM on these self-generated pairs using the same mixed objective.

The two-stage approach ensures that models learn to produce and utilize their own reasoning traces at inference, thus matching the training and inference distributions and resolving the mismatch inherent in annotation-based training.

3. Mathematical Formulation and Training Objective

Let $x$ denote the input context (instruction, rubric), $s$ denote the CoT explanation (sequence), $y \in \{1, ..., 5\}$ the numeric score, and $p(\cdot|x)$ the LLM output distribution.

CoT-RAIL Inference

At test time, for each input $x$ :

Sample a CoT explanation: $\hat{s} \sim p(\cdot|x)$
Predict score as expectation under the posterior:

$\hat{y}_{\text{CR}(x)} = \sum_{y \in \mathcal{Y}} p(\text{str}(y) | [x, \hat{s}]) \cdot y$

CoT-RAFT Training Objective

TRACT employs a weighted sum of regression-aware squared error loss and cross-entropy for CoT:

$\ell_{\rm CoT-RAFT}^{\lambda}(y^*, p_{\rm t}, p) =\; \lambda \left( \sum_{y \in \mathcal{Y}} p(\text{str}(y) | [x, \hat{s}]) \cdot y - y^* \right)^2 - \log p([\hat{s}, y^*] | x)$

where $\lambda$ is the mixing coefficient, and $\hat{s} \sim p_{\rm t}(\cdot|x)$ is sampled during Stage 1 from the annotation LLM, and during Stage 2 from the Stage 1 model.

4. Empirical Results and Ablation Studies

TRACT was evaluated across four LLM-as-a-Judge benchmarks (Feedback Bench, FLASK, Vicuna Bench, MT Bench) and with both Mistral-7B-Instruct-v0.2 and Llama-3.1-8B-Instruct as base models.

Performance: TRACT achieves state-of-the-art Pearson correlation (r = 0.65, Llama r = 0.675) versus Prometheus-2-7B (r = 0.591) and pure CE CoT training (r = 0.557), surpassing baselines by substantial margins.
Robustness: TRACT is robust to the number of sampled CoT traces at inference, performing well even with single trace sampling.
Ablations:
- Omitting Stage 2 (i.e., training only on annotation CoTs) results in a drop of 0.094 in Pearson r.
- Replacing CoT-RAFT with pure CE loss reduces performance by 0.033.
- Training on self-generated CoTs with CE loss alone is inferior to annotation-based training—self-generation is effective only when paired with regression-aware objectives.
- Starting Stage 2 from Stage 1 rather than base model leads to catastrophic overfitting and performance collapse (r = 0.515).
Distribution Matching: TRACT resolves the distribution mismatch between training and inference, evidenced by diagnostic experiments showing that annotation-only CoT training degrades at inference on self-generated traces.

Meta-evaluation via GPT-4 indicates self-generated CoTs are almost as high-quality as annotation CoTs (4.50 vs 4.78/5).

5. Comparative Analysis with Baselines

The following table summarizes the unique assets and performance of TRACT relative to key baselines:

Approach	Uses CoT	Regression-aware Loss	Self-Generated CoT	Avg. Pearson $r$
Standard CE-no-CoT	✗	✗	n/a	0.488
Standard CE w/ CoT	✓	✗	n/a	0.557
RAFT, no CoT	✗	✓	n/a	0.623
Prometheus-2-7B	✓	✗	n/a	0.591
TRACT (Ours)	✓	✓	✓	0.650

TRACT uniquely combines reasoning supervision and regression objectives in both annotation and self-generative stages, outperforming both token-level and regression-centric alternatives.

6. Practical Trade-Offs, Implementation, and Deployment

TRACT produces robust, distribution-matched models ready for deployment in evaluation tasks requiring rated or scored outputs with explicit reasoning justification.
The two-stage process is efficient and scalable; self-generated CoTs are high-quality and avoid catastrophic distributional shifts.
TRACT remains effective under inference-time compute constraints, allowing for minimal CoT sampling.
The mixed loss coefficient $\lambda$ can be tuned; results show robustness over a wide range of values.

Unique implementation insights:

The necessity to re-initialize Stage 2 from the base model avoids latent overfitting to annotation distributions.
Regression-aware loss is vital to preserve numeric accuracy, not just reasoning fluency.

7. Connections to Broader Regression-Aware and CoT Fine-Tuning Paradigms

TRACT is emblematic of advanced regression-aware and multi-phase CoT fine-tuning strategies, being:

Modular with respect to data source (annotation or self-generated),
Directly tied to the mechanisms uncovered in studies analyzing the stage-wise alignment of reasoning traces and internal representations (Yao et al., 7 Feb 2025),
A functional blueprint for scaling systematic generalization and robust reasoning into scoring and model-evaluation contexts.

8. Summary and Outlook

Two-stage Regression-Aware Fine-tuning with CoT (TRACT) provides an architecture-agnostic recipe for reasoning-intensive tasks where numeric regression and stepwise explanations are required. Its explicit CoT supervision, regression-aware score modeling, and self-distribution matching unlock both in-distribution and out-of-distribution generalization, as well as deployment scalability. TRACT sets a new standard for model-based evaluation, demonstrating the synergy of reasoning and numeric supervision for both research and applied domains.

PDF Markdown Chat (Pro)

References (1)

Unveiling the Mechanisms of Explicit CoT Training: How CoT Enhances Reasoning Generalization (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Two-stage Regression-Aware Fine-tuning with CoT Reasoning (TRACT).

TRACT: Two-stage Regression-Aware Fine-tuning

1. Motivation and Problem Setting

2. TRACT Framework: Two-Stage Regression-Aware CoT Fine-Tuning

3. Mathematical Formulation and Training Objective

CoT-RAIL Inference

CoT-RAFT Training Objective

4. Empirical Results and Ablation Studies

5. Comparative Analysis with Baselines

6. Practical Trade-Offs, Implementation, and Deployment

Unique implementation insights:

7. Connections to Broader Regression-Aware and CoT Fine-Tuning Paradigms

8. Summary and Outlook

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

TRACT: Two-stage Regression-Aware Fine-tuning

1. Motivation and Problem Setting

2. TRACT Framework: Two-Stage Regression-Aware CoT Fine-Tuning

3. Mathematical Formulation and Training Objective

CoT-RAIL Inference

CoT-RAFT Training Objective

4. Empirical Results and Ablation Studies

5. Comparative Analysis with Baselines

6. Practical Trade-Offs, Implementation, and Deployment

Unique implementation insights:

7. Connections to Broader Regression-Aware and CoT Fine-Tuning Paradigms

8. Summary and Outlook

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research