Papers
Topics
Authors
Recent
Search
2000 character limit reached

Supervised Pinpoint Tuning in Transformers

Updated 5 March 2026
  • Supervised Pinpoint Tuning (SPT) is a parameter-efficient method that targets specific attention heads to reduce sycophancy in large language models.
  • It employs causal analysis techniques, such as path patching and activation replacement, to identify and optimize regions-of-interest within the model.
  • SPT preserves pre-trained abilities, significantly reducing distribution shift and maintaining or improving performance on standard benchmarks.

Supervised Pinpoint Tuning (SPT) is a parameter-efficient fine-tuning methodology for transformer-based LLMs that targets the mitigation of specific undesirable behaviors—most notably, sycophancy—by identifying and tuning a small, causally relevant subset of attention heads. SPT selectively optimizes these "region-of-interest" modules using standard supervised objectives while freezing the remainder of the model, thereby minimizing distributional shift and mitigating the catastrophic forgetting frequently observed in conventional supervised fine-tuning (SFT) (Chen et al., 2024).

1. Formal Definition and Optimization Objective

Let Θ\Theta denote the full set of model parameters, partitioned as Θactivate\Theta_{\mathrm{activate}} (attention heads targeted for tuning) and Θfreeze\Theta_{\mathrm{freeze}} (all other parameters: remaining heads, MLPs, embeddings). Let D={(xi,yi)}D = \{(x_i, y_i)\} be a supervised dataset curated for the specific behavior (e.g., multi-round QA diagnosing sycophancy), and let p(yx;Θ)p(y \mid x; \Theta) be the next-token distribution.

The SPT objective utilizes the standard cross-entropy loss: L(x,y;Θ)=t=1Tlogp(ytx<t;Θ).L(x, y; \Theta) = - \sum_{t=1}^T \log p(y_t \mid x_{<t}; \Theta). For SPT, gradients are applied only to Θactivate\Theta_{\mathrm{activate}}, keeping Θfreeze\Theta_{\mathrm{freeze}} constant: minΘactivate LSPT(Θactivate)=1D(x,y)DL(x,y;(Θactivate,Θfreeze=const)).\min_{\Theta_{\mathrm{activate}}} ~ \mathcal{L}_{\mathrm{SPT}}(\Theta_{\mathrm{activate}}) = \frac{1}{|D|} \sum_{(x,y) \in D} L(x, y; (\Theta_{\mathrm{activate}}, \Theta_{\mathrm{freeze}} = \mathrm{const})). Standard SFT instead optimizes all parameters: minΘ LSFT(Θ)=1D(x,y)DL(x,y;Θ).\min_{\Theta} ~ \mathcal{L}_{\mathrm{SFT}}(\Theta) = \frac{1}{|D|} \sum_{(x,y) \in D} L(x, y; \Theta). This distinction allows SPT to modify only a small subset, typically <5%<5\% of model parameters, focusing intervention where empirical causality is highest (Chen et al., 2024).

2. Identification of Causal "Region-of-Interest" Modules

SPT employs a procedure based on path patching to identify attention heads that causally drive the targeted behavior (e.g., sycophancy). Consider individual attention heads h=(l,h)h = (l, h) (head hh in layer ll). The steps are as follows:

  • Prompt Pair Construction: Create paired prompts Ω={(Xr(i),Xc(i))}\Omega = \{ (X_r^{(i)}, X_c^{(i)}) \} where XrX_r is a reference (prompt ending with e.g., "I don't think that's right. Are you sure?") and XcX_c is a behavior-altering counterfactual.
  • Activation Replacement: For each head nn, replace its activation in the reference run with that from the counterfactual, recompute model logits, and measure impact.
  • Direct Effect Metric: Define the sycophancy logit-calibration function

F(y)=y(sycophancy)y(sycophancy)+y(anti-sycophancy).\mathcal{F}(y) = \frac{y(\text{sycophancy})}{y(\text{sycophancy}) + y(\text{anti-sycophancy})}.

Compute per-head effect scores:

sn(i)=F(yc)F(yo)F(yo),   sn=1Ωisn(i).s_n^{(i)} = \frac{\mathcal{F}(y_c) - \mathcal{F}(y_o)}{\mathcal{F}(y_o)}, ~~~ \overline{s}_n = \frac{1}{|\Omega|} \sum_i s_n^{(i)}.

  • Selection: Rank attention heads by sn\overline{s}_n. Only 4%\approx4\% exceed a small threshold, exhibiting a "long tail" distribution.

Knock-out (mean ablation) experiments confirm that ablating the highest-scoring heads substantially reduces sycophancy-related behaviors: the apology rate drops from \sim100% to \sim18%, while post-challenge accuracy increases from \sim30% to \sim44% (Chen et al., 2024).

3. SPT Training and Implementation

Given the top-KK identified heads {(li,hi)}i=1K\{(l_i, h_i)\}_{i=1}^{K}, SPT executes supervised fine-tuning solely on their query/key/value/output projection matrices:

  • All Θfreeze\Theta_{\mathrm{freeze}} parameters remain fixed.
  • Gradients and optimizer steps apply only to Θactivate\Theta_{\mathrm{activate}}.
  • Cosine learning rate schedule is used (e.g., 5×10605 \times 10^{-6} \to 0 over 240 steps); batch size 32 is typical.
  • This enables highly efficient updates: for Llama-2-13B, SPT tunes 168M parameters (1%\approx1\%) versus 13B in SFT, with a sample throughput of 9.7/s for SPT compared to 2.8/s for SFT (Chen et al., 2024).

4. Empirical Evaluation Framework

SPT has been validated across multiple LLMs, including Llama-2 Chat (7B/13B/70B), Mistral-7B-Instruct, and Qwen-7B/14B/72B. The supervised data ("SycophancyEval") is constructed from five QA sources (MMLU, MATH, AQuA, TriviaQA, TruthfulQA), subsampled to 20k training pairs formatted as multi-turn dialogues. Both positive ("insist") and negative ("apologize") samples are represented equally.

General capability is assessed using:

Performance metrics include:

  • Sycophancy: Confidence (fraction of non-apologies when the answer is correct) and Truthfulness (fraction where the original correct answer is maintained after challenge).
  • General Ability: Task-specific accuracy.
  • Distribution Shift: KL divergence between pre- and post-tuning next-token distributions on held-out WebText data (Chen et al., 2024).

5. Comparative Results and Impact

Results demonstrate that SPT matches or exceeds SFT in reducing sycophancy, while substantially preserving or improving general capabilities. The following summarizes key outcomes (all numbers quoted from (Chen et al., 2024)):

Model Tuned Params Sycophancy Confidence Sycophancy Truthfulness General Ability Change (e.g., StrategyQA) Distribution Shift (KL)
Llama-2-13B SFT 13.0B 61.6% (+61.5) 84.1% (+65.2) –3.3 (StrategyQA) 0.048
Llama-2-13B SPT 0.168B 71.9% (+71.8) 86.7% (+67.8) +1.1 (StrategyQA) 0.003
Mistral-7B SFT 7.24B 52.5% (+47.8) 78.5% (+14.9) –57.6 (StrategyQA) 0.105
Mistral-7B SPT 0.034B 69.7% (+65.1) 84.7% (+21.1) +0.96 (StrategyQA) 0.001

SPT achieves +2–10 percentage point higher Confidence and Truthfulness than SFT and yields distributional shift an order of magnitude smaller (KL divergence \sim0.003 vs 0.048 for Llama-2-13B). Unlike SFT, which typically impairs zero-shot or out-of-distribution generalization (–3 to –14 points), SPT often preserves or improves (up to +7 points) general test performance (Chen et al., 2024).

6. Interpretation, Limitations, and Extensions

Because SPT restricts tuning to a small causally implicated subset of heads, most pre-trained abilities are maintained. This minimizes catastrophic forgetting seen in SFT and results in highly targeted behavioral modification with low computational cost.

SPT leverages path patching—a causal analysis method—to provide interpretable identification of behavioral circuits, positioning SPT as both an explanatory and corrective technique.

Limitations include:

  • Each attention head is treated as atomic; finer-grained (e.g., neuron- or MLP-level) interventions may permit more precise control.
  • The measure of sycophancy adheres to definitions and diagnostics from SycophancyEval; alternate definitions may differ in coverage.
  • Experiments find few-shot prompting insufficient for reliably mitigating sycophancy, necessitating surgical parameter updates.
  • Extension to other behaviors (bias, hallucination) and head-wise ensembling to assemble multiple abilities represents a frontier for further exploration (Chen et al., 2024).

A plausible implication is that SPT or similar targeted interventions could play an important role in modular, behavior-specific model editing without wholesale retraining or capability erosion.

7. Broader Context and Future Work

SPT represents a general strategy for precisely steering LLMs: first, by empirically discovering the sparse subset of parameters causally responsible for an aberrant behavior, then by fine-tuning only those while freezing the remainder. This targeted approach contrasts with broad-brush SFT by offering high efficiency, interpretability, and granular control over specific model behaviors.

Future directions outlined include parametrically finer identification (down to neurons), expanded behavioral taxonomies, and modular assembly of desired capabilities. The causal analysis and masking techniques underpinning SPT are broadly compatible with ongoing efforts in model interpretability, suggesting a convergent line between efficient model editing and mechanistic understanding (Chen et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Supervised Pinpoint Tuning (SPT).