FLAMe-Opt-RM-24B: Efficient Tail-Patch Fine-Tuning

Updated 22 April 2026

The paper presents a novel tail-patch fine-tuning strategy that reduces training steps by 25x while maintaining near-identical RewardBench performance (87.0% vs 86.0%).
The paper leverages a 24B-parameter, decoder-only Transformer with 64 layers and rotary position embeddings to robustly evaluate language model outputs.
The paper demonstrates reduced autorater bias and outperforms competitors like GPT-4 in 8 out of 11 benchmarks, highlighting its generalization and adaptability.

FLAMe-24B refers to a large-scale, supervised-finetuned Transformer autorater model (“Foundational Large Autorater Model”) developed as part of the FLAMe family, targeting robust, generalizable automatic evaluation of LLM outputs. FLAMe-24B is constructed by further finetuning the PaLM-2-24B base model, itself a 24B-parameter decoder-only Transformer, on an extensive and highly diverse corpus of 5.3 million human judgments spanning 102 standardized quality assessment tasks. Its design emphasizes permissionless data sources and strong generalization across held-out tasks, establishing state-of-the-art results versus both open and closed-source LLM-based “judge models” across multiple evaluation regimes (Vu et al., 2024).

1. Model Architecture

FLAMe-24B builds upon the PaLM-2-24B architecture, featuring a decoder-only Transformer stack with the following specifications:

Layers (Transformer blocks): 64
Hidden dimension: 6,144
Feed-forward intermediate size: 16,384
Attention heads per layer: 48
Rotary position embeddings in attention
Pre-layer-normalization: LayerNorm precedes both self-attention and MLP sub-layers
Total parameters: ≃ $24 \times 10^9$ (24B)

A typical block follows: $\mu(x)$ 3 Layer normalization is mathematically expressed as:

$\mathrm{LayerNorm}(x) = \frac{x - \mu(x)}{\sqrt{\sigma^2(x) + \varepsilon}} \odot \gamma + \beta$

where $\mu(x)$ and $\sigma^2(x)$ denote the per-feature mean and variance, $\gamma$ and $\beta$ are learned parameters, and $\varepsilon = 10^{-5}$ ensures numerical stability.

2. Training Regimes and Supervised Finetuning

FLAMe-24B is trained via a two-stage, supervised, multitask finetuning pipeline:

Stage 1: General-purpose Autorater Finetuning (FLAMe)

Initialization: PaLM-2-24B
Data: 102 quality assessment tasks (5.3M human judgments); data covers pairwise comparisons, pointwise ratings, classification, and open-ended explanations
Mixture weights: Proportional to number of examples per task, capped at $2^{16}$ per task to prevent oversampling large datasets
Training: 30,000 steps, batch size 32 (256 Cloud TPU v5 chips)
Optimizer: Adam ( $1\times 10^{-4}$ ), dropout 0.05
Loss: Token-level cross-entropy in a text-to-text formulation:

$\mathcal{L}_{\mathrm{CE}} = -\sum_{t=1}^T \log P_\theta(y_t|x, y_{<t})$

Stage 2: Reward-Model–Specific Finetuning (FLAMe-RM)

From FLAMe checkpoint: 4 pairwise-preference datasets (HelpSteer, PRM800K, CommitPack, HH-RLHF Harmlessness), equally mixed
Training: 50 steps, batch size 8 (128 Cloud TPU v5 chips)
Loss: Pairwise logistic loss for each labeled comparison $\mathrm{LayerNorm}(x) = \frac{x - \mu(x)}{\sqrt{\sigma^2(x) + \varepsilon}} \odot \gamma + \beta$ 0:

$\mathrm{LayerNorm}(x) = \frac{x - \mu(x)}{\sqrt{\sigma^2(x) + \varepsilon}} \odot \gamma + \beta$ 1

where $\mathrm{LayerNorm}(x) = \frac{x - \mu(x)}{\sqrt{\sigma^2(x) + \varepsilon}} \odot \gamma + \beta$ 2.

This two-phase regimen enables both generalization (broad task exposure) and specialization (reward-model tuning).

3. Tail-Patch Fine-tuning Strategy

A novel “tail-patch” ablation strategy allows targeted reweighting of the training distribution to enhance domain-specific evaluation — exemplified by RewardBench tuning (“FLAMe-Opt-RM”):

A partially trained checkpoint is selected.
For each of 102 tasks, exclusive fine-tuning (“patching”) for 3,000 steps permits measuring its impact on RewardBench performance.
Tasks are rated as Helpful (+2), Somewhat helpful (+1), No effect (0), or Harmful (–1) for each RewardBench category.
Tasks are bundled (one “generally helpful”, five “category-specific”, one “others”). Bundle weights: $\mathrm{LayerNorm}(x) = \frac{x - \mu(x)}{\sqrt{\sigma^2(x) + \varepsilon}} \odot \gamma + \beta$ 3, $\mathrm{LayerNorm}(x) = \frac{x - \mu(x)}{\sqrt{\sigma^2(x) + \varepsilon}} \odot \gamma + \beta$ 4, $\mathrm{LayerNorm}(x) = \frac{x - \mu(x)}{\sqrt{\sigma^2(x) + \varepsilon}} \odot \gamma + \beta$ 5; top 2 tasks in three underperforming categories get $\mathrm{LayerNorm}(x) = \frac{x - \mu(x)}{\sqrt{\sigma^2(x) + \varepsilon}} \odot \gamma + \beta$ 6.
The final re-weighted mixture is used for a short (5,000 steps) finetune.

This mechanism requires approximately 25 times fewer datapoints (5,000 vs 30,000 steps) to achieve near-identical RewardBench performance (87.0% vs. 86.0%).

4. RewardBench and Comparative Evaluation

Evaluation on RewardBench (23 pairwise tasks: Chat, Chat Hard, Reasoning, Safety) demonstrates FLAMe-RM-24B’s competitive performance:

Model	Overall Accuracy
FLAMe-RM-24B	87.8%
GPT-4 (0125)	85.9%
GPT-4o	84.7%

A binomial 95% confidence interval for accuracy $\mathrm{LayerNorm}(x) = \frac{x - \mu(x)}{\sqrt{\sigma^2(x) + \varepsilon}} \odot \gamma + \beta$ 7, $\mathrm{LayerNorm}(x) = \frac{x - \mu(x)}{\sqrt{\sigma^2(x) + \varepsilon}} \odot \gamma + \beta$ 8 is:

$\mathrm{LayerNorm}(x) = \frac{x - \mu(x)}{\sqrt{\sigma^2(x) + \varepsilon}} \odot \gamma + \beta$ 9

The gap to GPT-4 (1.9 percentage points) is highly significant ( $\mu(x)$ 0, $\mu(x)$ 1).

5. Autorater Bias: CoBBLEr Benchmark Results

Bias is assessed via the CoBBLEr benchmark, which quantifies six forms of judgment bias: Order, Compassion, Length, Egocentric, Bandwagon, and Attention. Lower average values reflect lower bias.

Autorater	Avg.	Order	Compassion	Length	Egocentric	Bandwagon
GPT-4	0.31	0.23	0.79	0.06	0.78	0.00
FLAMe-24B	0.13	0.08	0.09	0.03	0.38	0.18
FLAMe-RM-24B	0.13	0.11	0.08	0.02	0.40	0.17
FLAMe-Opt-RM-24B	0.15	0.15	0.14	0.00	0.41	0.17

FLAMe-24B achieves less than half the average bias of GPT-4 (0.13 vs 0.31), indicating marked robustness against ordering, length, model name, or other confounds.

6. Aggregated Benchmark Performance Across Held-out Tasks

Comparative results across eleven held-out evaluation suites are summarized below. For each, accuracy is reported; in balanced pairwise settings this is a proxy for both precision and recall.

$\mu(x)$ 2

FLAMe-24B outperforms GPT-4-0125 in 8 out of 11 benchmarks. Its strong relative performance—despite relying solely on permissively licensed data for human evaluations—highlights its generalization capacity and adaptability for diverse evaluation settings.

Reference: "Foundational Autoraters: Taming LLMs for Better Automatic Evaluation" (Vu et al., 2024)

Markdown Report Issue Upgrade to Chat

References (1)

Foundational Autoraters: Taming Large Language Models for Better Automatic Evaluation (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FLAMe-Opt-RM-24B.