Papers
Topics
Authors
Recent
Search
2000 character limit reached

FLAMe-24B: Foundational Autorater Model

Updated 22 April 2026
  • The paper introduces FLAMe-24B, a decoder-only Transformer fine-tuned from PaLM-2-24B to automatically evaluate LLM outputs with high accuracy.
  • It employs a two-stage supervised fine-tuning pipeline and a novel tail-patch strategy to achieve near-equivalent RewardBench accuracy with 25× fewer datapoints.
  • FLAMe-24B demonstrates lower evaluation bias and outperforms leading models such as GPT-4 on multiple held-out benchmarks.

FLAMe-24B refers to a large-scale foundational autorater model adapted from the PaLM-2-24B Transformer, specialized for robust, generalizable automatic evaluation of LLM outputs via fine-tuning on extensive, permissively licensed human judgment datasets. The FLAMe-24B family is designed for LLM-as-a-Judge (LaaJ) applications, outperforming proprietary systems on multiple benchmarks, while demonstrating significantly reduced evaluation biases and efficiency gains with novel fine-tuning strategies (Vu et al., 2024).

1. Model Architecture

FLAMe-24B is implemented as a decoder-only Transformer, derived by fine-tuning the PaLM-2-24B base (Anil et al., 2023), with the following core architectural traits:

  • Layers: 64 Transformer blocks
  • Hidden dimension (“model width”): 6144
  • Feed-forward (MLP) intermediate size: 16,384
  • Attention heads per layer: 48
  • Rotary position embeddings: used in attention mechanisms
  • Pre-layer normalization: LayerNorm precedes both self-attention and MLP sub-layers
  • Parameter count: approximately 24×10924 \times 10^9

The canonical single Transformer block sequence is: μ(x)\mu(x)5 Layer normalization (with per-feature centering and scaling) is:

LayerNorm(x)=xμ(x)σ2(x)+εγ+β\mathrm{LayerNorm}(x) = \frac{x - \mu(x)}{\sqrt{\sigma^2(x) + \varepsilon}} \odot \gamma + \beta

where μ(x)\mu(x), σ2(x)\sigma^2(x): per-feature mean/variance, γ\gamma, β\beta: learned parameters, and ε=105\varepsilon = 10^{-5} for numerical stability.

2. Pretraining and Fine-tuning Regimes

FLAMe-24B is produced with a two-stage supervised finetuning pipeline following a full PaLM-2-24B initialization:

Stage 1: General Autorater Fine-tuning (FLAMe)

  • Data: 102 quality assessment tasks, 5.3M human judgments, spanning pairwise, pointwise, classification, and open-ended quality explanations.
  • Mixture weights: “examples-proportional,” truncated to 2162^{16} samples/task.
  • Optimization: Adam, learning rate 1×1041 \times 10^{-4}, dropout 0.05.
  • Schedule: 30,000 steps, batch size 32 on 256 Cloud TPU v5 chips.
  • Loss: Token-level cross-entropy for text-to-text tasks:

LCE=t=1TlogPθ(ytx,y<t)\mathcal{L}_{\mathrm{CE}} = -\sum_{t=1}^T \log P_\theta(y_t \mid x, y_{<t})

Stage 2: Reward-Model–Specific Finetuning (FLAMe-RM)

  • Initialization: FLAMe Stage 1 checkpoint
  • Data: Four pairwise preference datasets (HelpSteer, PRM800K, CommitPack, HH-RLHF Harmlessness), mixed equally.
  • Schedule: 50 finetuning steps, batch size 8 on 128 Cloud TPU v5 chips.
  • Loss: Pairwise logistic loss for human-labeled preferences:

LayerNorm(x)=xμ(x)σ2(x)+εγ+β\mathrm{LayerNorm}(x) = \frac{x - \mu(x)}{\sqrt{\sigma^2(x) + \varepsilon}} \odot \gamma + \beta0

where LayerNorm(x)=xμ(x)σ2(x)+εγ+β\mathrm{LayerNorm}(x) = \frac{x - \mu(x)}{\sqrt{\sigma^2(x) + \varepsilon}} \odot \gamma + \beta1.

3. Tail-Patch Fine-tuning Strategy

FLAMe-24B incorporates a “tail-patch” strategy to efficiently reweight mixture components for target evaluation distributions (e.g., RewardBench):

  • Ablation: For each of 102 training tasks, fine-tune individually for 3000 steps and measure per-category RewardBench effect.
  • Effect rating: Tasks are labeled as Helpful (+2), Somewhat helpful (+1), No effect (0), or Harmful (−1), and grouped into: one "Generally helpful" bundle, five "Category-specific" bundles, and an "Others" bundle.
  • Weighting: Assign fixed per-bundle weights (LayerNorm(x)=xμ(x)σ2(x)+εγ+β\mathrm{LayerNorm}(x) = \frac{x - \mu(x)}{\sqrt{\sigma^2(x) + \varepsilon}} \odot \gamma + \beta2K, LayerNorm(x)=xμ(x)σ2(x)+εγ+β\mathrm{LayerNorm}(x) = \frac{x - \mu(x)}{\sqrt{\sigma^2(x) + \varepsilon}} \odot \gamma + \beta3K, LayerNorm(x)=xμ(x)σ2(x)+εγ+β\mathrm{LayerNorm}(x) = \frac{x - \mu(x)}{\sqrt{\sigma^2(x) + \varepsilon}} \odot \gamma + \beta4K); the top two tasks in three weakest categories get LayerNorm(x)=xμ(x)σ2(x)+εγ+β\mathrm{LayerNorm}(x) = \frac{x - \mu(x)}{\sqrt{\sigma^2(x) + \varepsilon}} \odot \gamma + \beta5K.
  • Aggregate: Sum bundle weights/task across bundles.
  • Finetune: On this reweighted mixture for only 5000 steps.

This yields the FLAMe-Opt-RM variant, requiring LayerNorm(x)=xμ(x)σ2(x)+εγ+β\mathrm{LayerNorm}(x) = \frac{x - \mu(x)}{\sqrt{\sigma^2(x) + \varepsilon}} \odot \gamma + \beta6 fewer datapoints than standard FLAMe training to reach near-equivalent RewardBench accuracy (87.0% vs. 86.0%).

4. RewardBench and Generalization Performance

FLAMe-RM-24B demonstrates leading performance on the RewardBench suite of 23 pairwise tasks, across Chat, Chat Hard, Reasoning, and Safety categories. Key statistics:

Model Overall accuracy (%)
FLAMe-RM-24B 87.8
GPT-4-0125 85.9
GPT-4o 84.7

For LayerNorm(x)=xμ(x)σ2(x)+εγ+β\mathrm{LayerNorm}(x) = \frac{x - \mu(x)}{\sqrt{\sigma^2(x) + \varepsilon}} \odot \gamma + \beta7 test cases, the approximate LayerNorm(x)=xμ(x)σ2(x)+εγ+β\mathrm{LayerNorm}(x) = \frac{x - \mu(x)}{\sqrt{\sigma^2(x) + \varepsilon}} \odot \gamma + \beta8 confidence interval for LayerNorm(x)=xμ(x)σ2(x)+εγ+β\mathrm{LayerNorm}(x) = \frac{x - \mu(x)}{\sqrt{\sigma^2(x) + \varepsilon}} \odot \gamma + \beta9 is

μ(x)\mu(x)0

A two-proportion μ(x)\mu(x)1-test confirms the 1.9pp difference over GPT-4 is highly significant (μ(x)\mu(x)2, μ(x)\mu(x)3).

5. Autorater Bias Evaluation (CoBBLEr)

FLAMe-24B and its RM/Opt variants are evaluated on the CoBBLEr benchmark, which assesses six bias categories. Below are the absolute bias scores (lower is better):

Autorater Avg. Order Compassion Length Egocentric Bandwagon Attention
GPT-4 0.31 0.23 0.79 0.06 0.78 0.00 0.00
FLAMe-24B 0.13 0.08 0.09 0.03 0.38 0.18 0.00
FLAMe-RM-24B 0.13 0.11 0.08 0.02 0.40 0.17 0.00
FLAMe-Opt-RM-24B 0.15 0.15 0.14 0.00 0.41 0.17 0.00

The overall bias score of FLAMe-24B (0.13) is less than half that of GPT-4 (0.31), indicating superior resistance to biases associated with ordering, input length, egocentrism, and model name cues.

6. Comparative Results on Held-Out Autorater Benchmarks

FLAMe-24B achieves consistently strong results across eleven held-out LLM evaluation benchmarks, often outperforming leading closed models (e.g., GPT-4-0125) despite training exclusively on permissively licensed public data.

μ(x)\mu(x)4

In eight of eleven held-out tasks, FLAMe-24B exceeds GPT-4-0125, demonstrating robust cross-domain generalization for LLM evaluation and model comparison.

References

  • Foundational Autoraters: Taming LLMs for Better Automatic Evaluation, (Vu et al., 2024)

All data, terminology, and results are reported verbatim following (Vu et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FLAMe-24B.