Foundational Automatic Reasoning Evaluators (FARE)

Updated 22 October 2025

FARE are multi-task generative models that evaluate reasoning outputs across domains using techniques like pairwise comparisons and step-level error detection.
They utilize an innovative RS-SFT training protocol, which combines rejection sampling with a dynamic curriculum to scale on millions of annotated data samples.
FARE models outperform larger, specialized evaluators on benchmarks in math, code, and dialogue, demonstrating their practical impact in real-world applications.

Foundational Automatic Reasoning Evaluators (FARE) are a class of large-scale, multi-task generative models designed to provide robust, scalable, and high-fidelity evaluation of outputs in reasoning-centric domains. Rather than relying solely on ad hoc evaluators, small-scale classifiers, or string-matching heuristics, FARE leverage data-driven finetuning on multi-domain, multi-format annotated datasets to produce general-purpose evaluators capable of handling pairwise comparisons, step-level error detection, (reference-based and reference-free) verification, and single-score rating for tasks spanning mathematical reasoning, code, tool-use, dialogue, and more. This paradigm grows from the increasing demand for scalable evaluation both during model pretraining/fine-tuning and deployment, and is distinguished by its commitment to data scaling, broad generalization, and integration with real-world application loops.

1. Data Scaling, Curation, and Evaluation Formats

The central driver behind FARE is the systematic scaling of high-quality annotated data for evaluator pretraining and finetuning (Xu et al., 20 Oct 2025). The reference FARE pipeline curates a training corpus of 2.5 million samples, encompassing five distinct evaluation tasks:

Pairwise comparisons: Models are trained to pick the superior output among alternatives, critical for subjective and competitive domains.
Step-level evaluation: Fine-grained segmentation and assessment of logically ordered steps in outputs, enabling error localization within complex reasoning chains.
Reference-based verification: Automatic assessment of candidate outputs given a gold standard, facilitating training when labeled solutions exist.
Reference-free verification: Quality judgements in the absence of references, enabling use on open questions and generative tasks.
Single rating: Assigning numerical (discrete) marks to candidate responses.

Dataset curation includes both aggregation of established human and LLM-annotated benchmarks as well as programmatic error-injection (e.g., malforming tool calls) and a "generate-then-grade" grouping, where diverse generations for a ground-truth-verifiable prompt are sorted by correctness. This breadth ensures coverage of multiple reasoning modalities and supports generalization across domains.

Task Type	Annotation Method	Supported Domains
Pairwise	Human/LLM comparison	Math, code, dialogue, safety
Step-level	Segmented correctness	Multi-step math/code
Reference-based	Exact/verifiable	Math, tool use, code
Reference-free	Judgment only	Chat, open generation
Single rating	Numeric (discrete)	All

This systematic scaling represents a departure from prior evaluator pipelines, which often focused on single-task string-matching heuristics or small-scale domain data.

2. Model Architecture and Training Paradigms

FARE models are constructed atop modern pretrained LLM bases—reference implementations include 8B and 20B parameter versions, leveraging Qwen3-8B or GPT-OSS-20B architectures (Xu et al., 20 Oct 2025). The iterative rejection-sampling supervised finetuning (RS-SFT) protocol is central:

For each input, K=4 samples are drawn from the current model, with only examples matching ground-truth judgment retained for parameter updates.
Training proceeds in "disjoint rollout batches" with a batch-level curriculum based on the empirical success rate, ordering examples by output diversity and correctness frequency.
"Direct judgment" data is optionally mixed, with the LLM being prompted to produce only a judgment rather than a critique.

The objective is to maximize $\log \pi_\theta(y|x)$ over accepted $(x, y)$ pairs. This training protocol enables the stable ingestion of millions of data points, encompassing a rich distribution of error signatures and response structures across reasoning domains.

3. Benchmark Performance and Comparative Analysis

FARE models are evaluated across a battery of open and closed benchmarks:

JudgeBench, ReasoningJudgeBench, RM-Bench, ProcessBench, When2Call: Covering pairwise and step-level capability in math, code, RL, and tool-use.
VerifyBench-Hard, CodingJudgeBench: Focused on evaluating verification and generation quality, especially under hard negatives.

FARE-8B matches or outperforms many contemporaneous RL-finetuned evaluators of substantially larger size. FARE-20B surpasses specialized 70B+ parameter models on core benchmarks, establishing a new standard for open-source evaluators (Xu et al., 20 Oct 2025). In inference-time reranking applications on MATH, FARE-20B achieves near-oracle selection accuracy, able to reliably discriminate subtle differences in solution correctness across diverse LLM generations.

Empirical results demonstrate that FARE are effective not only at outcome-level ranking, but also at stepwise error detection, with demonstrated superiority in ProcessBench and fine-grained verification tasks.

4. Integration into Real-World Model Training and Verification Tasks

FARE evaluators are applied as inference-time rerankers—e.g., selecting the highest-quality solution among multiple candidate generations. On challenging mathematical reasoning benchmarks, FARE-20B matches or approaches oracle selection accuracy, resulting in substantial improvements to system-wide performance even with limited candidate pools.

Beyond static evaluation, FARE is deployed as a verifier within RL fine-tuning regimes. In Generalized Reward Policy Optimization (GRPO), substituting FARE for string-matching verifiers improves downstream model performance by up to 14.1% (Xu et al., 20 Oct 2025), highlighting the importance of high-fidelity, learned evaluation signals in training robust reward models and mitigating reward hacking.

The continual finetuning capability is also emphasized: e.g., FARE-Code is produced by further training FARE-20B on a 15k-sample code-pairwise set, achieving a 65% improvement over gpt-oss-20B in code test-case evaluation on CodingJudgeBench.

5. Methodological Distinctions and Training Innovations

The RS-SFT protocol represents a significant methodological advancement, uniting exploration (through K-way sampling under a temperature) with exploitation (via direct-jump acceptance to ground-truth alignment). Rejection sampling at scale ensures that the training signal remains focused on correct judgments, even when underlying distributions are noisy.

The per-batch continuous curriculum provides dynamic task difficulty adjustment, facilitating stable and efficient scaling over millions of diverse samples. Mixing directive and critique data enhances the evaluator’s flexibility in both reference-based and free-form evaluation paradigms.

Training is fully data-driven; no hand-crafted metrics or human-coded heuristics are introduced, which supports both harder negative mining and broader generalization over new domains and task formats.

6. Significance and Future Directions

FARE exemplifies a shift in evaluation research, prioritizing scalable, high-capacity multi-task evaluators built through direct data scaling, rather than manual rule engineering or exclusive RL fine-tuning. The RS-SFT approach, with its combination of rejection sampling, dynamic curriculum, and continual finetuning, produces evaluators that are competitive with, or outperform, much larger specialized architectures.

These results underline the importance of:

Diverse coverage in task curation, combining human and synthetic data.
Robust, multi-format loss aggregation supporting evidence-based rating.
Generalization across reasoning domains, including chat, code, math, tool-use, and open-ended generation.

A plausible implication is that future FARE systems may incorporate further hierarchical curricula, larger base models, and more granular evaluation axes (e.g., explainability, fairness, multi-agent reasoning). The strong empirical findings support the adoption of FARE models as both deployment-time evaluators and as integral components in RLHF and self-improving training flows.

7. Summary Table

FARE Variant	Size	Key Benchmarks Outperformed	Application Domains
FARE-8B	8B	RM-Bench, Pairwise RL Evaluators	Code, math, chat, RL
FARE-20B	20B	70B+ RL-finetuned evaluators	Code, math, open dialogue
FARE-Code	20B+	gpt-oss-20B (CodingJudgeBench)	Test-case/code quality

These models, their data curation, and their methodological choices mark a substantial advance in the landscape of foundational automatic reasoning evaluators suitable for open-domain, reasoning-heavy evaluation regimes.

PDF Markdown Chat (Pro)

References (1)

Foundational Automatic Evaluators: Scaling Multi-Task Generative Evaluator Training for Reasoning-Centric Domains (2025)

Follow Topic

Get notified by email when new papers are published related to Foundational Automatic Reasoning Evaluators (FARE).