- The paper introduces TRACT, a novel two-stage fine-tuning method that enhances LLM-as-a-judge performance by integrating chain-of-thought reasoning with regression-aware training.
- TRACT employs a CoT-RAFT objective and a two-stage process to align CoT generation with the model's distribution and directly optimize for numerical score prediction accuracy.
- Evaluations show TRACT consistently outperforms baselines, achieving a 0.650 Pearson's r with Mistral-7B, demonstrating improved accuracy for LLM-based judging.
This paper introduces Two-stage Regression-Aware Fine-tuning with CoT reasoning (TRACT), a novel method for enhancing the performance of LLMs in the LLM-as-a-judge paradigm. TRACT addresses limitations in existing approaches that either neglect the numerical nature of score prediction or fail to incorporate CoT reasoning.
The paper posits that directly applying CE loss for fine-tuning LLMs as judges is suboptimal for numerical target prediction. The authors observe that CE loss penalizes vastly different numerical errors equally, failing to account for the inherent ordinality in numerical scoring tasks. To overcome this, the paper adopts RAFT, which uses a squared error loss during fine-tuning to directly optimize for numerical accuracy. However, RAFT does not incorporate CoT reasoning, which has been shown to improve LLM-as-a-judge performance.
TRACT combines the strengths of CoT reasoning with regression-aware training via a two-stage fine-tuning process. The key innovation is the CoT-RAFT fine-tuning objective, which is a weighted sum of the CE loss for CoT supervision and the RAFT loss for score prediction:
$\ell_{\rm CoT-RAFT}^{\lambda}( y^*, p_{\rm t}, p ) = \lambda \left( \sum_{y \in \mathcal{Y} p( str( y ) \,|\,[x, \hat{s}]) \cdot y - y^* \right)^2 - \log {p}( [\hat{s}, y^*] \,|\, x); \hat{s} \sim p_{\rm t}(\cdot \,|\, x)$
where:
- y∗ is the ground truth score
- pt is the target model used to generate CoTs for training
- p is the model being trained
- x is the input
- s^ is the generated CoT
- Y is the set of possible numerical targets
- λ is a weighting coefficient
The two stages of TRACT are designed to align the CoT supervision with the model's CoT distribution:
- Stage 1: A seed LLM, denoted as ps, is fine-tuned using the CoT-RAFT objective, with CoTs generated by an annotation model pa, such as GPT-4. This stage aims to impart initial CoT reasoning capabilities to the model.
- Stage 2: The model ps from stage 1 is used to generate its own CoTs. These self-generated CoTs, along with the ground truth scores, form a new training dataset. A new model, ptract, is then fine-tuned from the original seed LLM using the CoT-RAFT objective, but with CoTs sampled from the frozen ps. This stage aligns the CoT distribution used during training with that of the model itself, mitigating distribution shift issues.
For inference, the paper uses a CoT-RAIL predictor:
$\hat{y}_{\rm CR}(x) = \sum_{y \in \mathcal{Y} p( str( y ) \,|\,[x, \hat{s}]) \cdot y; \hat{s} \sim p(\cdot \,|\,x)$
which samples a CoT s^ conditioned on the input x and then applies the RAIL predictor when conditioning on both x and s^.
The paper evaluates TRACT on four LLM-as-a-judge datasets: Feedback Bench, FLASK, Vicuna Bench, and MT Bench. The models used are Mistral-7B-Instruct-v0.2 and Llama-3.1-8B-Instruct. The results demonstrate that TRACT consistently outperforms existing methods, including Prometheus-2-7B, a strong baseline of comparable size. Key results include:
- TRACT achieves a Pearson's r correlation of 0.650 on average across the four datasets when using Mistral-7B, significantly outperforming standard fine-tuning with CoT (0.557) and Prometheus-2-7B (0.591).
- TRACT outperforms RAFT, demonstrating the benefits of integrating CoT reasoning into regression-aware fine-tuning.
- Ablation studies validate the importance of both stages of fine-tuning and the use of the CoT-RAFT objective.
The paper includes several analyses:
- An analysis of the distribution shift between annotation CoTs and self-generated CoTs, showing that TRACT effectively bridges this gap.
- A sensitivity analysis of the λ weighting coefficient in the CoT-RAFT objective, showing that TRACT is robust to a range of λ values.
- A comparison of multi-objective fine-tuning (CoT-RAFT) versus sequential single-objective fine-tuning (CE followed by RAFT), demonstrating the superiority of the former.
- An analysis of the impact of scaling the number of sampled CoTs during inference, revealing that TRACT performs well even with limited inference-time compute.
Overall, the paper presents a comprehensive paper of how to effectively combine CoT reasoning with regression-aware fine-tuning for LLM-as-a-judge. The TRACT method and the resulting models represent a significant advancement in the field, offering improved accuracy and efficiency compared to existing approaches.