FLAMe-Opt-RM-24B: Efficient Tail-Patch Fine-Tuning
- The paper presents a novel tail-patch fine-tuning strategy that reduces training steps by 25x while maintaining near-identical RewardBench performance (87.0% vs 86.0%).
- The paper leverages a 24B-parameter, decoder-only Transformer with 64 layers and rotary position embeddings to robustly evaluate language model outputs.
- The paper demonstrates reduced autorater bias and outperforms competitors like GPT-4 in 8 out of 11 benchmarks, highlighting its generalization and adaptability.
FLAMe-24B refers to a large-scale, supervised-finetuned Transformer autorater model (“Foundational Large Autorater Model”) developed as part of the FLAMe family, targeting robust, generalizable automatic evaluation of LLM outputs. FLAMe-24B is constructed by further finetuning the PaLM-2-24B base model, itself a 24B-parameter decoder-only Transformer, on an extensive and highly diverse corpus of 5.3 million human judgments spanning 102 standardized quality assessment tasks. Its design emphasizes permissionless data sources and strong generalization across held-out tasks, establishing state-of-the-art results versus both open and closed-source LLM-based “judge models” across multiple evaluation regimes (Vu et al., 2024).
1. Model Architecture
FLAMe-24B builds upon the PaLM-2-24B architecture, featuring a decoder-only Transformer stack with the following specifications:
- Layers (Transformer blocks): 64
- Hidden dimension: 6,144
- Feed-forward intermediate size: 16,384
- Attention heads per layer: 48
- Rotary position embeddings in attention
- Pre-layer-normalization: LayerNorm precedes both self-attention and MLP sub-layers
- Total parameters: ≃ (24B)
A typical block follows: 3 Layer normalization is mathematically expressed as:
where and denote the per-feature mean and variance, and are learned parameters, and ensures numerical stability.
2. Training Regimes and Supervised Finetuning
FLAMe-24B is trained via a two-stage, supervised, multitask finetuning pipeline:
Stage 1: General-purpose Autorater Finetuning (FLAMe)
- Initialization: PaLM-2-24B
- Data: 102 quality assessment tasks (5.3M human judgments); data covers pairwise comparisons, pointwise ratings, classification, and open-ended explanations
- Mixture weights: Proportional to number of examples per task, capped at per task to prevent oversampling large datasets
- Training: 30,000 steps, batch size 32 (256 Cloud TPU v5 chips)
- Optimizer: Adam (), dropout 0.05
- Loss: Token-level cross-entropy in a text-to-text formulation:
Stage 2: Reward-Model–Specific Finetuning (FLAMe-RM)
- From FLAMe checkpoint: 4 pairwise-preference datasets (HelpSteer, PRM800K, CommitPack, HH-RLHF Harmlessness), equally mixed
- Training: 50 steps, batch size 8 (128 Cloud TPU v5 chips)
- Loss: Pairwise logistic loss for each labeled comparison 0:
1
where 2.
This two-phase regimen enables both generalization (broad task exposure) and specialization (reward-model tuning).
3. Tail-Patch Fine-tuning Strategy
A novel “tail-patch” ablation strategy allows targeted reweighting of the training distribution to enhance domain-specific evaluation — exemplified by RewardBench tuning (“FLAMe-Opt-RM”):
- A partially trained checkpoint is selected.
- For each of 102 tasks, exclusive fine-tuning (“patching”) for 3,000 steps permits measuring its impact on RewardBench performance.
- Tasks are rated as Helpful (+2), Somewhat helpful (+1), No effect (0), or Harmful (–1) for each RewardBench category.
- Tasks are bundled (one “generally helpful”, five “category-specific”, one “others”). Bundle weights: 3, 4, 5; top 2 tasks in three underperforming categories get 6.
- The final re-weighted mixture is used for a short (5,000 steps) finetune.
This mechanism requires approximately 25 times fewer datapoints (5,000 vs 30,000 steps) to achieve near-identical RewardBench performance (87.0% vs. 86.0%).
4. RewardBench and Comparative Evaluation
Evaluation on RewardBench (23 pairwise tasks: Chat, Chat Hard, Reasoning, Safety) demonstrates FLAMe-RM-24B’s competitive performance:
A binomial 95% confidence interval for accuracy 7, 8 is:
9
The gap to GPT-4 (1.9 percentage points) is highly significant (0, 1).
5. Autorater Bias: CoBBLEr Benchmark Results
Bias is assessed via the CoBBLEr benchmark, which quantifies six forms of judgment bias: Order, Compassion, Length, Egocentric, Bandwagon, and Attention. Lower average values reflect lower bias.
| Autorater | Avg. | Order | Compassion | Length | Egocentric | Bandwagon | Attention |
|---|---|---|---|---|---|---|---|
| GPT-4 | 0.31 | 0.23 | 0.79 | 0.06 | 0.78 | 0.00 | 0.00 |
| FLAMe-24B | 0.13 | 0.08 | 0.09 | 0.03 | 0.38 | 0.18 | 0.00 |
| FLAMe-RM-24B | 0.13 | 0.11 | 0.08 | 0.02 | 0.40 | 0.17 | 0.00 |
| FLAMe-Opt-RM-24B | 0.15 | 0.15 | 0.14 | 0.00 | 0.41 | 0.17 | 0.00 |
FLAMe-24B achieves less than half the average bias of GPT-4 (0.13 vs 0.31), indicating marked robustness against ordering, length, model name, or other confounds.
6. Aggregated Benchmark Performance Across Held-out Tasks
Comparative results across eleven held-out evaluation suites are summarized below. For each, accuracy is reported; in balanced pairwise settings this is a proxy for both precision and recall.
2
FLAMe-24B outperforms GPT-4-0125 in 8 out of 11 benchmarks. Its strong relative performance—despite relying solely on permissively licensed data for human evaluations—highlights its generalization capacity and adaptability for diverse evaluation settings.
Reference: "Foundational Autoraters: Taming LLMs for Better Automatic Evaluation" (Vu et al., 2024)