FLAMe-24B: Foundational Autorater Model

Updated 22 April 2026

The paper introduces FLAMe-24B, a decoder-only Transformer fine-tuned from PaLM-2-24B to automatically evaluate LLM outputs with high accuracy.
It employs a two-stage supervised fine-tuning pipeline and a novel tail-patch strategy to achieve near-equivalent RewardBench accuracy with 25× fewer datapoints.
FLAMe-24B demonstrates lower evaluation bias and outperforms leading models such as GPT-4 on multiple held-out benchmarks.

FLAMe-24B refers to a large-scale foundational autorater model adapted from the PaLM-2-24B Transformer, specialized for robust, generalizable automatic evaluation of LLM outputs via fine-tuning on extensive, permissively licensed human judgment datasets. The FLAMe-24B family is designed for LLM-as-a-Judge (LaaJ) applications, outperforming proprietary systems on multiple benchmarks, while demonstrating significantly reduced evaluation biases and efficiency gains with novel fine-tuning strategies (Vu et al., 2024).

1. Model Architecture

FLAMe-24B is implemented as a decoder-only Transformer, derived by fine-tuning the PaLM-2-24B base (Anil et al., 2023), with the following core architectural traits:

Layers: 64 Transformer blocks
Hidden dimension (“model width”): 6144
Feed-forward (MLP) intermediate size: 16,384
Attention heads per layer: 48
Rotary position embeddings: used in attention mechanisms
Pre-layer normalization: LayerNorm precedes both self-attention and MLP sub-layers
Parameter count: approximately $24 \times 10^9$

The canonical single Transformer block sequence is: $\mu(x)$ 5 Layer normalization (with per-feature centering and scaling) is:

$\mathrm{LayerNorm}(x) = \frac{x - \mu(x)}{\sqrt{\sigma^2(x) + \varepsilon}} \odot \gamma + \beta$

where $\mu(x)$ , $\sigma^2(x)$ : per-feature mean/variance, $\gamma$ , $\beta$ : learned parameters, and $\varepsilon = 10^{-5}$ for numerical stability.

2. Pretraining and Fine-tuning Regimes

FLAMe-24B is produced with a two-stage supervised finetuning pipeline following a full PaLM-2-24B initialization:

Stage 1: General Autorater Fine-tuning (FLAMe)

Data: 102 quality assessment tasks, 5.3M human judgments, spanning pairwise, pointwise, classification, and open-ended quality explanations.
Mixture weights: “examples-proportional,” truncated to $2^{16}$ samples/task.
Optimization: Adam, learning rate $1 \times 10^{-4}$ , dropout 0.05.
Schedule: 30,000 steps, batch size 32 on 256 Cloud TPU v5 chips.
Loss: Token-level cross-entropy for text-to-text tasks:

$\mathcal{L}_{\mathrm{CE}} = -\sum_{t=1}^T \log P_\theta(y_t \mid x, y_{<t})$

Stage 2: Reward-Model–Specific Finetuning (FLAMe-RM)

Initialization: FLAMe Stage 1 checkpoint
Data: Four pairwise preference datasets (HelpSteer, PRM800K, CommitPack, HH-RLHF Harmlessness), mixed equally.
Schedule: 50 finetuning steps, batch size 8 on 128 Cloud TPU v5 chips.
Loss: Pairwise logistic loss for human-labeled preferences:

$\mathrm{LayerNorm}(x) = \frac{x - \mu(x)}{\sqrt{\sigma^2(x) + \varepsilon}} \odot \gamma + \beta$ 0

where $\mathrm{LayerNorm}(x) = \frac{x - \mu(x)}{\sqrt{\sigma^2(x) + \varepsilon}} \odot \gamma + \beta$ 1.

3. Tail-Patch Fine-tuning Strategy

FLAMe-24B incorporates a “tail-patch” strategy to efficiently reweight mixture components for target evaluation distributions (e.g., RewardBench):

Ablation: For each of 102 training tasks, fine-tune individually for 3000 steps and measure per-category RewardBench effect.
Effect rating: Tasks are labeled as Helpful (+2), Somewhat helpful (+1), No effect (0), or Harmful (−1), and grouped into: one "Generally helpful" bundle, five "Category-specific" bundles, and an "Others" bundle.
Weighting: Assign fixed per-bundle weights ( $\mathrm{LayerNorm}(x) = \frac{x - \mu(x)}{\sqrt{\sigma^2(x) + \varepsilon}} \odot \gamma + \beta$ 2K, $\mathrm{LayerNorm}(x) = \frac{x - \mu(x)}{\sqrt{\sigma^2(x) + \varepsilon}} \odot \gamma + \beta$ 3K, $\mathrm{LayerNorm}(x) = \frac{x - \mu(x)}{\sqrt{\sigma^2(x) + \varepsilon}} \odot \gamma + \beta$ 4K); the top two tasks in three weakest categories get $\mathrm{LayerNorm}(x) = \frac{x - \mu(x)}{\sqrt{\sigma^2(x) + \varepsilon}} \odot \gamma + \beta$ 5K.
Aggregate: Sum bundle weights/task across bundles.
Finetune: On this reweighted mixture for only 5000 steps.

This yields the FLAMe-Opt-RM variant, requiring $\mathrm{LayerNorm}(x) = \frac{x - \mu(x)}{\sqrt{\sigma^2(x) + \varepsilon}} \odot \gamma + \beta$ 6 fewer datapoints than standard FLAMe training to reach near-equivalent RewardBench accuracy (87.0% vs. 86.0%).

4. RewardBench and Generalization Performance

FLAMe-RM-24B demonstrates leading performance on the RewardBench suite of 23 pairwise tasks, across Chat, Chat Hard, Reasoning, and Safety categories. Key statistics:

Model	Overall accuracy (%)
FLAMe-RM-24B	87.8
GPT-4-0125	85.9
GPT-4o	84.7

For $\mathrm{LayerNorm}(x) = \frac{x - \mu(x)}{\sqrt{\sigma^2(x) + \varepsilon}} \odot \gamma + \beta$ 7 test cases, the approximate $\mathrm{LayerNorm}(x) = \frac{x - \mu(x)}{\sqrt{\sigma^2(x) + \varepsilon}} \odot \gamma + \beta$ 8 confidence interval for $\mathrm{LayerNorm}(x) = \frac{x - \mu(x)}{\sqrt{\sigma^2(x) + \varepsilon}} \odot \gamma + \beta$ 9 is

$\mu(x)$ 0

A two-proportion $\mu(x)$ 1-test confirms the 1.9pp difference over GPT-4 is highly significant ( $\mu(x)$ 2, $\mu(x)$ 3).

5. Autorater Bias Evaluation (CoBBLEr)

FLAMe-24B and its RM/Opt variants are evaluated on the CoBBLEr benchmark, which assesses six bias categories. Below are the absolute bias scores (lower is better):

Autorater	Avg.	Order	Compassion	Length	Egocentric	Bandwagon
GPT-4	0.31	0.23	0.79	0.06	0.78	0.00
FLAMe-24B	0.13	0.08	0.09	0.03	0.38	0.18
FLAMe-RM-24B	0.13	0.11	0.08	0.02	0.40	0.17
FLAMe-Opt-RM-24B	0.15	0.15	0.14	0.00	0.41	0.17

The overall bias score of FLAMe-24B (0.13) is less than half that of GPT-4 (0.31), indicating superior resistance to biases associated with ordering, input length, egocentrism, and model name cues.

6. Comparative Results on Held-Out Autorater Benchmarks

FLAMe-24B achieves consistently strong results across eleven held-out LLM evaluation benchmarks, often outperforming leading closed models (e.g., GPT-4-0125) despite training exclusively on permissively licensed public data.

$\mu(x)$ 4

In eight of eleven held-out tasks, FLAMe-24B exceeds GPT-4-0125, demonstrating robust cross-domain generalization for LLM evaluation and model comparison.

References

Foundational Autoraters: Taming LLMs for Better Automatic Evaluation, (Vu et al., 2024)

All data, terminology, and results are reported verbatim following (Vu et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

Foundational Autoraters: Taming Large Language Models for Better Automatic Evaluation (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FLAMe-24B.