Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

Gemini 2.5 Flash 89 tok/s

Gemini 2.5 Pro 49 tok/s Pro

GPT-5 Medium 11 tok/s

GPT-5 High 17 tok/s Pro

GPT-4o 77 tok/s

GPT OSS 120B 476 tok/s Pro

Kimi K2 232 tok/s Pro

2000 character limit reached

TRACT: Regression-Aware Fine-tuning Meets Chain-of-Thought Reasoning for LLM-as-a-Judge (2503.04381v2)

Published 6 Mar 2025 in cs.CL

Abstract: The LLM-as-a-judge paradigm uses LLMs for automated text evaluation, where a numerical assessment is assigned by an LLM to the input text following scoring rubrics. Existing methods for LLM-as-a-judge use cross-entropy (CE) loss for fine-tuning, which neglects the numeric nature of score prediction. Recent work addresses numerical prediction limitations of LLM fine-tuning through regression-aware fine-tuning, which, however, does not consider chain-of-thought (CoT) reasoning for score prediction. In this paper, we introduce TRACT (Two-stage Regression-Aware fine-tuning with CoT), a method combining CoT reasoning with regression-aware training. TRACT consists of two stages: first, seed LLM is fine-tuned to generate CoTs, which serve as supervision for the second stage fine-tuning. The training objective of TRACT combines the CE loss for learning the CoT reasoning capabilities, and the regression-aware loss for the score prediction. Experiments across four LLM-as-a-judge datasets and two LLMs show that TRACT significantly outperforms existing methods. Extensive ablation studies validate the importance of each component in TRACT.

Collections

Summary

The paper introduces TRACT, a novel two-stage fine-tuning method that enhances LLM-as-a-judge performance by integrating chain-of-thought reasoning with regression-aware training.
TRACT employs a CoT-RAFT objective and a two-stage process to align CoT generation with the model's distribution and directly optimize for numerical score prediction accuracy.
Evaluations show TRACT consistently outperforms baselines, achieving a 0.650 Pearson's r with Mistral-7B, demonstrating improved accuracy for LLM-based judging.

This paper introduces Two-stage Regression-Aware Fine-tuning with CoT reasoning (TRACT), a novel method for enhancing the performance of LLMs in the LLM-as-a-judge paradigm. TRACT addresses limitations in existing approaches that either neglect the numerical nature of score prediction or fail to incorporate CoT reasoning.

The paper posits that directly applying CE loss for fine-tuning LLMs as judges is suboptimal for numerical target prediction. The authors observe that CE loss penalizes vastly different numerical errors equally, failing to account for the inherent ordinality in numerical scoring tasks. To overcome this, the paper adopts RAFT, which uses a squared error loss during fine-tuning to directly optimize for numerical accuracy. However, RAFT does not incorporate CoT reasoning, which has been shown to improve LLM-as-a-judge performance.

TRACT combines the strengths of CoT reasoning with regression-aware training via a two-stage fine-tuning process. The key innovation is the CoT-RAFT fine-tuning objective, which is a weighted sum of the CE loss for CoT supervision and the RAFT loss for score prediction:

$\ell_{\rm CoT-RAFT}^{\lambda}( y^*, p_{\rm t}, p ) = \lambda \left( \sum_{y \in \mathcal{Y} p( str( y ) \,|\,[x, \hat{s}]) \cdot y - y^* \right)^2 - \log {p}( [\hat{s}, y^*] \,|\, x); \hat{s} \sim p_{\rm t}(\cdot \,|\, x)$

where:

$y^*$ is the ground truth score
$p_{\rm t}$ is the target model used to generate CoTs for training
$p$ is the model being trained
$x$ is the input
$\hat{s}$ is the generated CoT
$\mathcal{Y}$ is the set of possible numerical targets
$\lambda$ is a weighting coefficient

The two stages of TRACT are designed to align the CoT supervision with the model's CoT distribution:

Stage 1: A seed LLM, denoted as $p_{\rm s}$ , is fine-tuned using the CoT-RAFT objective, with CoTs generated by an annotation model $p_{\rm a}$ , such as GPT-4. This stage aims to impart initial CoT reasoning capabilities to the model.
Stage 2: The model $p_{\rm s}$ from stage 1 is used to generate its own CoTs. These self-generated CoTs, along with the ground truth scores, form a new training dataset. A new model, $p_{\rm tract}$ , is then fine-tuned from the original seed LLM using the CoT-RAFT objective, but with CoTs sampled from the frozen $p_{\rm s}$ . This stage aligns the CoT distribution used during training with that of the model itself, mitigating distribution shift issues.

For inference, the paper uses a CoT-RAIL predictor:

$\hat{y}_{\rm CR}(x) = \sum_{y \in \mathcal{Y} p( str( y ) \,|\,[x, \hat{s}]) \cdot y; \hat{s} \sim p(\cdot \,|\,x)$

which samples a CoT $\hat{s}$ conditioned on the input $x$ and then applies the RAIL predictor when conditioning on both $x$ and $\hat{s}$ .

The paper evaluates TRACT on four LLM-as-a-judge datasets: Feedback Bench, FLASK, Vicuna Bench, and MT Bench. The models used are Mistral-7B-Instruct-v0.2 and Llama-3.1-8B-Instruct. The results demonstrate that TRACT consistently outperforms existing methods, including Prometheus-2-7B, a strong baseline of comparable size. Key results include:

TRACT achieves a Pearson's $r$ correlation of 0.650 on average across the four datasets when using Mistral-7B, significantly outperforming standard fine-tuning with CoT (0.557) and Prometheus-2-7B (0.591).
TRACT outperforms RAFT, demonstrating the benefits of integrating CoT reasoning into regression-aware fine-tuning.
Ablation studies validate the importance of both stages of fine-tuning and the use of the CoT-RAFT objective.

The paper includes several analyses:

An analysis of the distribution shift between annotation CoTs and self-generated CoTs, showing that TRACT effectively bridges this gap.
A sensitivity analysis of the $\lambda$ weighting coefficient in the CoT-RAFT objective, showing that TRACT is robust to a range of $\lambda$ values.
A comparison of multi-objective fine-tuning (CoT-RAFT) versus sequential single-objective fine-tuning (CE followed by RAFT), demonstrating the superiority of the former.
An analysis of the impact of scaling the number of sampled CoTs during inference, revealing that TRACT performs well even with limited inference-time compute.

Overall, the paper presents a comprehensive paper of how to effectively combine CoT reasoning with regression-aware fine-tuning for LLM-as-a-judge. The TRACT method and the resulting models represent a significant advancement in the field, offering improved accuracy and efficiency compared to existing approaches.

PDF Markdown

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Authors (3)

Tweets

https://twitter.com/dcml0714/status/1898537061399109634

https://twitter.com/fly51fly/status/1898122716018098646

https://twitter.com/arxivsanitybot/status/1898919552714502590

https://twitter.com/miclukasik/status/1912499268281118941