CompassJudger-2: Robust LLM Judge Models
- CompassJudger-2 is a family of generalist LLM judge models designed to deliver robust, multi-domain evaluations using an autoregressive, decoder-only transformer architecture.
- The system leverages task-driven multi-domain data curation and explicit chain-of-thought prompts to generate coherent, verifiable judgments.
- Empirical results demonstrate its superior benchmark scores and improved downstream performance compared to both parameter-matched and larger baseline models.
CompassJudger-2 is a family of generalist LLM judge models designed to deliver robust, wide-coverage evaluations of LLM-generated outputs. Developed as an autoregressive, decoder-only transformer architecture based on Qwen2.5-Instruct, CompassJudger-2 addresses the limitations of prior judge models—principally, narrow specialization and limited robustness—through a combination of task-driven, multi-domain data curation and verifiable-reward-guided optimization. The system incorporates explicit instructional breakdowns via chain-of-thought (CoT) reasoning, a policy gradient framework centered on margin-based loss, and a suite of benchmark tools that set new standards for cross-domain LLM judgment (Zhang et al., 12 Jul 2025).
1. Model Architecture and Training Workflow
CompassJudger-2 comprises end-to-end fine-tuned variants based on the Qwen2.5-Instruct backbone, instantiated at approximately 7B and 32B parameter scales. The full training pipeline encompasses the following stages:
- Construction of a unified supervised fine-tuning (SFT) corpus aggregated from four principal data sources.
- Supervision via prompts eliciting critical thinking, specifically structured chain-of-thought rationale generation.
- For each input, generation of multiple candidate judgments and rigorous filtering via rejection sampling.
- Definition of programmatic, verifiable reward signals tied to explicit tokens in the judgment sequence.
- Optimization of a composite objective, , where is a margin-policy-gradient loss computed at the designated prediction position.
All model weights are trainable throughout, with no module freezing in either the 7B or 32B configurations.
2. Task-Driven, Multi-Domain Data Curation
The training dataset synthesizes five main sources to maximize both judgment generality and instruction-following breadth:
| Source Type | Description | Sourcing Approach |
|---|---|---|
| Public Judge Data | Critiques + explanations, stratified by “outdated” vs. “up-to-date” | Outdated data is re-labeled by Qwen2.5-72B and verified; up-to-date data diversified using modern prompt templates from ArenaHard, WildBench, MTBench, etc. |
| Public Reward Data | Ground-truth labels only | Qwen2.5-72B generates CoT judgments per label; rejection sampling yields RFT data matching ground truth. |
| Synthetic Knowledge-based | Outputs from benchmarks (MMLU, CMMLU, GSM8K) | Qwen2.5-72B provides verification and rationales; only verified are retained. |
| Synthetic Chat-based | Contrastive response pairs on style (formality, brevity) | Qwen2.5-72B selects superior responses for style prompts. |
| General Instruction (G-SFT) | Broad instruction-following data from CompassJudger-1 | Preserves general LLM capabilities. |
This broad data composition specifically targets the pitfalls of judge over-specialization and enables the derivation of fine-grained, multi-perspective supervision signals.
3. Verifiable Rewards and Critical Chain-of-Thought Supervision
CompassJudger-2 supervision relies on “critical thinking” CoT prompts, partitioning the judgment process into five segments: understanding the user demand, enumerating strengths and weaknesses of candidates, aggregating reasoning, and producing a final prediction. For each data point, the verifiable reward function is constructed as:
where is the input, is the model output, is the specified judgment token position, and denotes the ground-truth label.
The policy-gradient objective is defined by
and can be reduced, via the REINFORCE estimator, to a log-probability at . To widen exploration beyond teacher-forced prefixes, multiple candidate trajectories are sampled for each instruction, with only those matching the ground truth at retained by rejection sampling. The policy gradient loss is approximated as:
0
This scheme guides the LLM to generate coherent rationale and robust predictions aligned with explicit, verifiable targets.
4. Margin Policy Gradient Loss and Objective Blending
The optimized loss function integrates standard SFT cross-entropy (applied at all tokens except 1) with a margin-based mapping function at the prediction position. Three candidates were evaluated:
- Direct Preference Optimization (DPO) loss,
- Temperature-scaled log-ratio,
- Margin loss (default), defined by
2
where 3 is the correct target, 4 is an incorrect candidate, and 5.
Margin loss was empirically favored, yielding +2.21 percentage points over the SFT-only baseline, considerably exceeding the alternative objective variants.
5. JudgerBenchV2 Benchmark and Evaluation Metrics
JudgerBenchV2 was developed to provide a rigorous, large-scale assessment of LLM judge models:
- 10,000 real-user queries, balanced across 10 scenarios, languages, and difficulty bins.
- For each query, responses are generated from 10 candidate models and compared pairwise against GPT-4o-mini.
- Ground-truth judgments are determined by majority (“Mixture of Judgers”) voting among DeepSeek-R1, DeepSeek-V3-0324, and Qwen3-235B-A22B.
Performance metric 6 is defined as:
7
where 8 is the agreement count, 9 the number of comparisons, 0 win counts, and 1 the derived rankings.
6. Empirical Results and Robustness
CompassJudger-2-7B achieves a JudgerBenchV2 score of 60.52 (vs. Qwen2.5-7B’s 57.14; DeepSeek-V3-0324-32B’s 64.43), with an average judge/reward score of 72.11 (outperforming Qwen2.5-7B’s 57.27). The 32B variant reaches 62.21 on JudgerBenchV2, and 73.32 average score, exceeding Qwen3-235B’s 71.91. On secondary benchmarks such as JudgeBench, RMB, RewardBench, and challenging tasks including MMLU Pro, GPQA Diamond, and ArenaHard, CompassJudger-2 consistently outperforms both parameter-matched and much larger baselines.
Critique-help experiments on AlignBench and AlpacaEval show that CompassJudger-2-generated critiques improve downstream policy model performance by ≈3 percentage points, while weaker judges occasionally degrade policy outcomes.
Loss ablation results reveal that replacing margin loss with DPO or temperature objectives diminishes gains (+0.2–0.7 pp); margin policy gradient remains critical for maximal improvements. Data source ablation identifies RFT (rejection-filtered reward data) as essential, with its omission causing a significant decline in both judge consistency and general ability—e.g., GPQA and ArenaHard—while removing G-SFT mainly affects general instruction tasks but not pure judgment performance. Under style-prompts, CompassJudger-2 exhibits strong robustness (≤2 pp variation), contrasting with prominent drops (up to 10 pp) for other judge models such as RISE-32B.
7. Significance, Limitations, and Future Directions
The explicit coupling of verifiable reward signals to structured chain-of-thought rationale emerges as a key factor in CompassJudger-2’s improved accuracy and generalization. The integration of a margin-based policy gradient framework advances supervision methodologies beyond conventional RLHF-style objectives. However, drawbacks include increased inference cost from rejection sampling and potential hallucinations in synthetic rationale data. Further progress is anticipated by extending this foundation to multimodal and interactive judgment scenarios—such as dialogue roll-outs and vision+language tasks—aiming for a universal LLM “judge” capable of supporting broad ecosystem evaluation requirements (Zhang et al., 12 Jul 2025).