Papers
Topics
Authors
Recent
Search
2000 character limit reached

FLAMe-Opt-RM-24B: Efficient Tail-Patch Fine-Tuning

Updated 22 April 2026
  • The paper presents a novel tail-patch fine-tuning strategy that reduces training steps by 25x while maintaining near-identical RewardBench performance (87.0% vs 86.0%).
  • The paper leverages a 24B-parameter, decoder-only Transformer with 64 layers and rotary position embeddings to robustly evaluate language model outputs.
  • The paper demonstrates reduced autorater bias and outperforms competitors like GPT-4 in 8 out of 11 benchmarks, highlighting its generalization and adaptability.

FLAMe-24B refers to a large-scale, supervised-finetuned Transformer autorater model (“Foundational Large Autorater Model”) developed as part of the FLAMe family, targeting robust, generalizable automatic evaluation of LLM outputs. FLAMe-24B is constructed by further finetuning the PaLM-2-24B base model, itself a 24B-parameter decoder-only Transformer, on an extensive and highly diverse corpus of 5.3 million human judgments spanning 102 standardized quality assessment tasks. Its design emphasizes permissionless data sources and strong generalization across held-out tasks, establishing state-of-the-art results versus both open and closed-source LLM-based “judge models” across multiple evaluation regimes (Vu et al., 2024).

1. Model Architecture

FLAMe-24B builds upon the PaLM-2-24B architecture, featuring a decoder-only Transformer stack with the following specifications:

  • Layers (Transformer blocks): 64
  • Hidden dimension: 6,144
  • Feed-forward intermediate size: 16,384
  • Attention heads per layer: 48
  • Rotary position embeddings in attention
  • Pre-layer-normalization: LayerNorm precedes both self-attention and MLP sub-layers
  • Total parameters:24×10924 \times 10^9 (24B)

A typical block follows: μ(x)\mu(x)3 Layer normalization is mathematically expressed as:

LayerNorm(x)=xμ(x)σ2(x)+εγ+β\mathrm{LayerNorm}(x) = \frac{x - \mu(x)}{\sqrt{\sigma^2(x) + \varepsilon}} \odot \gamma + \beta

where μ(x)\mu(x) and σ2(x)\sigma^2(x) denote the per-feature mean and variance, γ\gamma and β\beta are learned parameters, and ε=105\varepsilon = 10^{-5} ensures numerical stability.

2. Training Regimes and Supervised Finetuning

FLAMe-24B is trained via a two-stage, supervised, multitask finetuning pipeline:

Stage 1: General-purpose Autorater Finetuning (FLAMe)

  • Initialization: PaLM-2-24B
  • Data: 102 quality assessment tasks (5.3M human judgments); data covers pairwise comparisons, pointwise ratings, classification, and open-ended explanations
  • Mixture weights: Proportional to number of examples per task, capped at 2162^{16} per task to prevent oversampling large datasets
  • Training: 30,000 steps, batch size 32 (256 Cloud TPU v5 chips)
  • Optimizer: Adam (1×1041\times 10^{-4}), dropout 0.05
  • Loss: Token-level cross-entropy in a text-to-text formulation:

LCE=t=1TlogPθ(ytx,y<t)\mathcal{L}_{\mathrm{CE}} = -\sum_{t=1}^T \log P_\theta(y_t|x, y_{<t})

Stage 2: Reward-Model–Specific Finetuning (FLAMe-RM)

  • From FLAMe checkpoint: 4 pairwise-preference datasets (HelpSteer, PRM800K, CommitPack, HH-RLHF Harmlessness), equally mixed
  • Training: 50 steps, batch size 8 (128 Cloud TPU v5 chips)
  • Loss: Pairwise logistic loss for each labeled comparison LayerNorm(x)=xμ(x)σ2(x)+εγ+β\mathrm{LayerNorm}(x) = \frac{x - \mu(x)}{\sqrt{\sigma^2(x) + \varepsilon}} \odot \gamma + \beta0:

LayerNorm(x)=xμ(x)σ2(x)+εγ+β\mathrm{LayerNorm}(x) = \frac{x - \mu(x)}{\sqrt{\sigma^2(x) + \varepsilon}} \odot \gamma + \beta1

where LayerNorm(x)=xμ(x)σ2(x)+εγ+β\mathrm{LayerNorm}(x) = \frac{x - \mu(x)}{\sqrt{\sigma^2(x) + \varepsilon}} \odot \gamma + \beta2.

This two-phase regimen enables both generalization (broad task exposure) and specialization (reward-model tuning).

3. Tail-Patch Fine-tuning Strategy

A novel “tail-patch” ablation strategy allows targeted reweighting of the training distribution to enhance domain-specific evaluation — exemplified by RewardBench tuning (“FLAMe-Opt-RM”):

  • A partially trained checkpoint is selected.
  • For each of 102 tasks, exclusive fine-tuning (“patching”) for 3,000 steps permits measuring its impact on RewardBench performance.
  • Tasks are rated as Helpful (+2), Somewhat helpful (+1), No effect (0), or Harmful (–1) for each RewardBench category.
  • Tasks are bundled (one “generally helpful”, five “category-specific”, one “others”). Bundle weights: LayerNorm(x)=xμ(x)σ2(x)+εγ+β\mathrm{LayerNorm}(x) = \frac{x - \mu(x)}{\sqrt{\sigma^2(x) + \varepsilon}} \odot \gamma + \beta3, LayerNorm(x)=xμ(x)σ2(x)+εγ+β\mathrm{LayerNorm}(x) = \frac{x - \mu(x)}{\sqrt{\sigma^2(x) + \varepsilon}} \odot \gamma + \beta4, LayerNorm(x)=xμ(x)σ2(x)+εγ+β\mathrm{LayerNorm}(x) = \frac{x - \mu(x)}{\sqrt{\sigma^2(x) + \varepsilon}} \odot \gamma + \beta5; top 2 tasks in three underperforming categories get LayerNorm(x)=xμ(x)σ2(x)+εγ+β\mathrm{LayerNorm}(x) = \frac{x - \mu(x)}{\sqrt{\sigma^2(x) + \varepsilon}} \odot \gamma + \beta6.
  • The final re-weighted mixture is used for a short (5,000 steps) finetune.

This mechanism requires approximately 25 times fewer datapoints (5,000 vs 30,000 steps) to achieve near-identical RewardBench performance (87.0% vs. 86.0%).

4. RewardBench and Comparative Evaluation

Evaluation on RewardBench (23 pairwise tasks: Chat, Chat Hard, Reasoning, Safety) demonstrates FLAMe-RM-24B’s competitive performance:

Model Overall Accuracy
FLAMe-RM-24B 87.8%
GPT-4 (0125) 85.9%
GPT-4o 84.7%

A binomial 95% confidence interval for accuracy LayerNorm(x)=xμ(x)σ2(x)+εγ+β\mathrm{LayerNorm}(x) = \frac{x - \mu(x)}{\sqrt{\sigma^2(x) + \varepsilon}} \odot \gamma + \beta7, LayerNorm(x)=xμ(x)σ2(x)+εγ+β\mathrm{LayerNorm}(x) = \frac{x - \mu(x)}{\sqrt{\sigma^2(x) + \varepsilon}} \odot \gamma + \beta8 is:

LayerNorm(x)=xμ(x)σ2(x)+εγ+β\mathrm{LayerNorm}(x) = \frac{x - \mu(x)}{\sqrt{\sigma^2(x) + \varepsilon}} \odot \gamma + \beta9

The gap to GPT-4 (1.9 percentage points) is highly significant (μ(x)\mu(x)0, μ(x)\mu(x)1).

5. Autorater Bias: CoBBLEr Benchmark Results

Bias is assessed via the CoBBLEr benchmark, which quantifies six forms of judgment bias: Order, Compassion, Length, Egocentric, Bandwagon, and Attention. Lower average values reflect lower bias.

Autorater Avg. Order Compassion Length Egocentric Bandwagon Attention
GPT-4 0.31 0.23 0.79 0.06 0.78 0.00 0.00
FLAMe-24B 0.13 0.08 0.09 0.03 0.38 0.18 0.00
FLAMe-RM-24B 0.13 0.11 0.08 0.02 0.40 0.17 0.00
FLAMe-Opt-RM-24B 0.15 0.15 0.14 0.00 0.41 0.17 0.00

FLAMe-24B achieves less than half the average bias of GPT-4 (0.13 vs 0.31), indicating marked robustness against ordering, length, model name, or other confounds.

6. Aggregated Benchmark Performance Across Held-out Tasks

Comparative results across eleven held-out evaluation suites are summarized below. For each, accuracy is reported; in balanced pairwise settings this is a proxy for both precision and recall.

μ(x)\mu(x)2

FLAMe-24B outperforms GPT-4-0125 in 8 out of 11 benchmarks. Its strong relative performance—despite relying solely on permissively licensed data for human evaluations—highlights its generalization capacity and adaptability for diverse evaluation settings.


Reference: "Foundational Autoraters: Taming LLMs for Better Automatic Evaluation" (Vu et al., 2024)

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FLAMe-Opt-RM-24B.