Papers
Topics
Authors
Recent
2000 character limit reached

HCNR: Restoring Honesty in Fine-Tuned LLMs

Updated 20 November 2025
  • HCNR is a parameter-efficient technique that restores honesty-critical neurons in large language models post-supervised fine-tuning.
  • It utilizes Fisher-based scoring and Hessian-guided compensation to surgically adjust neurons without disrupting domain-specific tasks.
  • Experiments reveal that HCNR recovers over 33% of honesty loss with minimal data and compute, outperforming traditional global retraining methods.

Honesty-Critical Neurons Restoration (HCNR) is a parameter-efficient approach for recovering the honesty trait in LLMs that have undergone supervised fine-tuning (SFT). HCNR leverages the empirical insight that, post-SFT, the model’s internal representations still encode knowledge about what the model does or does not know; however, this knowledge is no longer faithfully expressed in outputs. By surgically identifying and repairing only the neurons critical for honesty expression—and aligning them with downstream task-oriented neurons via Hessian-guided compensation—HCNR restores honesty with minimal impact on downstream task accuracy, substantially reducing data and computational requirements compared to prior methods (Shi et al., 17 Nov 2025).

1. Theoretical Motivation

Empirical analysis demonstrates that SFT, while effective for domain specialization, often leads to a collapse in “honesty”—the model’s willingness to admit uncertainty or refuse to answer unanswerable queries. Importantly, linear probes reveal that the internal knowledge-boundary signals in pre-trained LLMs remain largely intact even after SFT; it is the capacity to express this awareness that is degraded. This suggests that SFT masks, rather than destroys, the model’s self-knowledge, reframing post-SFT dishonesty as an “expression” rather than a “knowledge” problem. Consequently, honesty restoration does not require global parameter retraining but can be accomplished by addressing a small subset of “expression-governing” neurons (Shi et al., 17 Nov 2025).

2. Identification of Honesty-Critical Neurons

HCNR targets neurons that meet three criteria: (a) high influence on honesty loss, (b) minimal effect on the domain task, and (c) substantial perturbation during SFT. The selection procedure is two-fold:

  • Fisher-based intra-layer importance: For weight tensor WjW_j in layer jj and neuron kk:

sj,khon=E(x,y)Dhon[(Wj,kLhon(x,y))2],sj,ktask=E(x,y)Dtask[(Wj,kLtask(x,y))2]s^{\text{hon}}_{j,k} = \mathbb{E}_{(x,y)\sim D^{\text{hon}}} \left[ \left( \partial_{W_{j,k}} \mathcal{L}_{\mathrm{hon}}(x,y) \right)^2 \right ], \qquad s^{\text{task}}_{j,k} = \mathbb{E}_{(x,y)\sim D^{\text{task}}} \left[ \left( \partial_{W_{j,k}} \mathcal{L}_{\mathrm{task}}(x,y) \right)^2 \right ]

The priority score is defined as:

rj,k=sj,khon×log(sj,khonsj,ktask+ϵ)r_{j,k} = s^{\text{hon}}_{j,k} \times \log\left( \frac{s^{\text{hon}}_{j,k}}{s^{\text{task}}_{j,k} + \epsilon} \right)

Top-ranked neurons per layer (top RIWR_{IW} fraction) are candidates.

  • Cross-layer perturbation: For candidates, the SFT-induced shift per layer is:

dj=(WjsftWjorig)Mj2WjorigMj2d_j = \frac{\| (W_j^{\text{sft}} - W_j^{\text{orig}} ) \odot M_j \|_2 }{ \| W_j^{\text{orig}} \odot M_j \|_2 }

Layers are ranked by djd_j, and the top RCWR_{CW} fraction are designated “honesty-critical”. The final honesty-critical set comprises the top neurons within these layers:

Ahc={(j,k)jAlayer,  kAjneuron}A^{\rm hc} = \{ (j,k) \mid j \in A^{\rm layer},\; k \in A_j^{\rm neuron} \}

3. Hessian-Guided Compensation Mechanism

Directly resetting honesty-critical neurons to their pre-trained values can misalign these neurons with task-adapted circuitry. HCNR addresses this by optimizing a small, neuron-specific compensation using second-order (Hessian) information from an honesty calibration loss:

  • Discrepancy loss: For activations XhonX_{\rm hon} on an honesty dataset,

dhon=WhcXhonWorigXhon22d_{\text{hon}} = \| W^{\rm hc} X_{\rm hon} - W^{\rm orig} X_{\rm hon} \|_2^2

  • Layerwise Hessian computation: Compute (or approximate) the Hessian with respect to pre-trained weights:

Hj=Wjorig2dhonWjorigH_j = \nabla^2_{W^{\rm orig}_j} d_{\rm hon} \big|_{W^{\rm orig}_j}

Given SFT perturbation δwj,k=Wj,ksftWj,korig\delta w_{j, k} = W^{\rm sft}_{j,k} - W^{\rm orig}_{j,k}, optimal compensation is

cj,k=[Hj1δwj]kc_{j, k} = [H_j^{-1} \delta w_j]_k

The restored weight is set as Wj,khc=Wj,korig+cj,kW^{\rm hc}_{j,k} = W^{\rm orig}_{j,k} + c_{j,k}.

This mechanism ensures restored neurons are locally consistent with the Hessian geometry of the fine-tuned network, harmonizing honesty-related and task-related functionalities.

4. HCNR Algorithmic Workflow

The HCNR procedure comprises two algorithmic stages:

  1. Neuron Recognition:
    • Compute intra-layer honesty and task Fisher scores (shon,stasks^{\rm hon}, s^{\rm task}), form priority scores (rj,kr_{j,k}), and select top candidates (AjneuronA_j^{\rm neuron}).
    • Compute per-layer SFT perturbation (djd_j), select top honesty-critical layers (AlayerA^{\rm layer}).
    • Aggregate honesty-critical neuron indices (AhcA^{\rm hc}), initialize honesty-restored weights by setting (j,k)Ahc(j, k) \in A^{\rm hc} to pre-trained values.
  2. Hessian-Guided Compensation:
    • Compute layerwise Hessians HjH_j on DhonD^{\rm hon}.
    • For each (j,k)Ahc(j, k) \in A^{\rm hc}, derive and apply the scalar compensation cj,kc_{j, k}.

The result is a set of model weights WhcW^{\rm hc} with restored honesty expression and preserved domain task capacity.

5. Experimental Validation and Comparative Analysis

HCNR has been empirically validated across four QA tasks and five LLM families. Key findings include:

  • Honesty recovery: On average, HCNR restores 33.25% of the honesty lost to SFT.
  • Pareto optimality: The approach lies strictly above baseline task-honesty trade-off curves (evaluated with KUQ/SelfAware benchmarks).
  • Efficiency: For Llama-3.1-8B-Instruct on HotpotQA and MedMCQA, baselines (RAIT, DPO, ORPO) require 4,000–9,000 “IDK” samples and 100% of parameters, while HCNR uses only 256 honest + 128 task samples and modifies ≈20% of weights. Wall-clock improvements range from 2.2× to 10×.
  • Ablation analysis: Randomization or omission of importance scoring (Stage 1) reduces accuracy; skipping Hessian compensation (Stage 2) reduces honesty recovery by ~40%. Combining both ablations eliminates recovery.
  • Data sensitivity: Performance saturates at 128 samples for honesty and task calibration, confirming minimal data demands.

6. Context, Significance, and Implications

HCNR introduces a novel paradigm based on mechanistic understanding: post-SFT dishonesty primarily affects expression, not internal knowledge estimation. By restricting intervention to a sparse, mechanistically identified subset of neurons and leveraging Hessian-guided compensation for alignment, HCNR achieves a favorable blend of restoration effectiveness and efficiency. A plausible implication is that the approach generalizes to other fine-tuning-induced behavioral degradations where internal model knowledge is preserved but suppressed in output. This suggests new directions for model repair that prioritize targeted, data- and compute-light interventions over global retraining (Shi et al., 17 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Honesty-Critical Neurons Restoration (HCNR).