FLAMe-24B: Foundational Autorater Model
- The paper introduces FLAMe-24B, a decoder-only Transformer fine-tuned from PaLM-2-24B to automatically evaluate LLM outputs with high accuracy.
- It employs a two-stage supervised fine-tuning pipeline and a novel tail-patch strategy to achieve near-equivalent RewardBench accuracy with 25× fewer datapoints.
- FLAMe-24B demonstrates lower evaluation bias and outperforms leading models such as GPT-4 on multiple held-out benchmarks.
FLAMe-24B refers to a large-scale foundational autorater model adapted from the PaLM-2-24B Transformer, specialized for robust, generalizable automatic evaluation of LLM outputs via fine-tuning on extensive, permissively licensed human judgment datasets. The FLAMe-24B family is designed for LLM-as-a-Judge (LaaJ) applications, outperforming proprietary systems on multiple benchmarks, while demonstrating significantly reduced evaluation biases and efficiency gains with novel fine-tuning strategies (Vu et al., 2024).
1. Model Architecture
FLAMe-24B is implemented as a decoder-only Transformer, derived by fine-tuning the PaLM-2-24B base (Anil et al., 2023), with the following core architectural traits:
- Layers: 64 Transformer blocks
- Hidden dimension (“model width”): 6144
- Feed-forward (MLP) intermediate size: 16,384
- Attention heads per layer: 48
- Rotary position embeddings: used in attention mechanisms
- Pre-layer normalization: LayerNorm precedes both self-attention and MLP sub-layers
- Parameter count: approximately
The canonical single Transformer block sequence is: 5 Layer normalization (with per-feature centering and scaling) is:
where , : per-feature mean/variance, , : learned parameters, and for numerical stability.
2. Pretraining and Fine-tuning Regimes
FLAMe-24B is produced with a two-stage supervised finetuning pipeline following a full PaLM-2-24B initialization:
Stage 1: General Autorater Fine-tuning (FLAMe)
- Data: 102 quality assessment tasks, 5.3M human judgments, spanning pairwise, pointwise, classification, and open-ended quality explanations.
- Mixture weights: “examples-proportional,” truncated to samples/task.
- Optimization: Adam, learning rate , dropout 0.05.
- Schedule: 30,000 steps, batch size 32 on 256 Cloud TPU v5 chips.
- Loss: Token-level cross-entropy for text-to-text tasks:
Stage 2: Reward-Model–Specific Finetuning (FLAMe-RM)
- Initialization: FLAMe Stage 1 checkpoint
- Data: Four pairwise preference datasets (HelpSteer, PRM800K, CommitPack, HH-RLHF Harmlessness), mixed equally.
- Schedule: 50 finetuning steps, batch size 8 on 128 Cloud TPU v5 chips.
- Loss: Pairwise logistic loss for human-labeled preferences:
0
where 1.
3. Tail-Patch Fine-tuning Strategy
FLAMe-24B incorporates a “tail-patch” strategy to efficiently reweight mixture components for target evaluation distributions (e.g., RewardBench):
- Ablation: For each of 102 training tasks, fine-tune individually for 3000 steps and measure per-category RewardBench effect.
- Effect rating: Tasks are labeled as Helpful (+2), Somewhat helpful (+1), No effect (0), or Harmful (−1), and grouped into: one "Generally helpful" bundle, five "Category-specific" bundles, and an "Others" bundle.
- Weighting: Assign fixed per-bundle weights (2K, 3K, 4K); the top two tasks in three weakest categories get 5K.
- Aggregate: Sum bundle weights/task across bundles.
- Finetune: On this reweighted mixture for only 5000 steps.
This yields the FLAMe-Opt-RM variant, requiring 6 fewer datapoints than standard FLAMe training to reach near-equivalent RewardBench accuracy (87.0% vs. 86.0%).
4. RewardBench and Generalization Performance
FLAMe-RM-24B demonstrates leading performance on the RewardBench suite of 23 pairwise tasks, across Chat, Chat Hard, Reasoning, and Safety categories. Key statistics:
| Model | Overall accuracy (%) |
|---|---|
| FLAMe-RM-24B | 87.8 |
| GPT-4-0125 | 85.9 |
| GPT-4o | 84.7 |
For 7 test cases, the approximate 8 confidence interval for 9 is
0
A two-proportion 1-test confirms the 1.9pp difference over GPT-4 is highly significant (2, 3).
5. Autorater Bias Evaluation (CoBBLEr)
FLAMe-24B and its RM/Opt variants are evaluated on the CoBBLEr benchmark, which assesses six bias categories. Below are the absolute bias scores (lower is better):
| Autorater | Avg. | Order | Compassion | Length | Egocentric | Bandwagon | Attention |
|---|---|---|---|---|---|---|---|
| GPT-4 | 0.31 | 0.23 | 0.79 | 0.06 | 0.78 | 0.00 | 0.00 |
| FLAMe-24B | 0.13 | 0.08 | 0.09 | 0.03 | 0.38 | 0.18 | 0.00 |
| FLAMe-RM-24B | 0.13 | 0.11 | 0.08 | 0.02 | 0.40 | 0.17 | 0.00 |
| FLAMe-Opt-RM-24B | 0.15 | 0.15 | 0.14 | 0.00 | 0.41 | 0.17 | 0.00 |
The overall bias score of FLAMe-24B (0.13) is less than half that of GPT-4 (0.31), indicating superior resistance to biases associated with ordering, input length, egocentrism, and model name cues.
6. Comparative Results on Held-Out Autorater Benchmarks
FLAMe-24B achieves consistently strong results across eleven held-out LLM evaluation benchmarks, often outperforming leading closed models (e.g., GPT-4-0125) despite training exclusively on permissively licensed public data.
4
In eight of eleven held-out tasks, FLAMe-24B exceeds GPT-4-0125, demonstrating robust cross-domain generalization for LLM evaluation and model comparison.
References
- Foundational Autoraters: Taming LLMs for Better Automatic Evaluation, (Vu et al., 2024)
All data, terminology, and results are reported verbatim following (Vu et al., 2024).