Negative-Aware Fine-Tuning (NFT)

Updated 11 August 2025

Negative-aware Fine-Tuning (NFT) is a set of techniques that incorporate negative signals such as errors, adversarial examples, and noise to enhance model generalization and safety.
NFT methods modify training objectives by using explicit penalization, dynamic feedback loops, and selective forgetting to unlearn misleading or harmful features.
Empirical studies show that NFT often outperforms conventional fine-tuning methods in robustness, convergence speed, and reducing harmful outputs across various modalities.

Negative-aware Fine-Tuning (NFT) encompasses a range of techniques designed to improve the generalization, robustness, and safety of deep neural networks by explicitly incorporating, penalizing, or unlearning negative information during the fine-tuning phase. Negative information may refer to incorrect model generations, adversarial samples, misleading tokens, hardware-induced noise, or underperforming feature representations. NFT subsumes methodologies for both vision and LLMs, and recent advances connect NFT to broader efforts in robust optimization, causal inference, and preference alignment. Below are the major principles, representative algorithms, mathematical formulations, and empirical findings underlying NFT.

1. Conceptual Foundations

NFT is characterized by its treatment of negative signals, which are intentionally leveraged to improve model properties that are unattainable via conventional fine-tuning. The central distinction is that, rather than ignoring misclassified, adversarial, noisy, or misleading samples, NFT methods integrate these negatives for penalization or corrective learning. Prominent paradigms include:

Explicit penalization: Negative outputs (e.g., hallucinated answers, adversarial perturbations) are directly penalized in the loss or likelihood (e.g. discriminative fine-tuning (Guo et al., 25 Feb 2025), forgetting (Ghahrizjani et al., 6 Aug 2025)).
Negative feedback loops: Multi-scale noisy signals are fed back into model optimization to combat device variation or epistemic uncertainty (e.g. Negative Feedback Training (Qin et al., 2023)).
Fine-grained segment emphasis: Error-sensitive regions or rare features receive dynamic weighting for more discriminative learning (e.g. fault-aware (Fan et al., 21 Mar 2025), concept-wise (Yang et al., 2023)).
Safety restoration and patching: Post-hoc restoration of safety-critical neurons via neuron transplantation to mitigate harmful fine-tuning (e.g. NLSR (Yi et al., 2024)).
Masking and unlearning: Selective attenuation or forgetting of negative (uninformative or misleading) features or neurons to sharpen the knowledge boundary and improve diversity (e.g. neural masks (Karim et al., 2024), token-level forgetting (Ghahrizjani et al., 6 Aug 2025)).

NFT thus unifies diverse avenues for using negative information—contrasts, errors, adversarial noise, or detrimental features—for model alignment and robustification.

2. Mathematical Formulations and Algorithms

NFT frameworks are typically formalized by extending the loss function or training objective to interact directly with negative data or representations. Representative formulations:

Discriminative Likelihood (DFT):

$P_{d}(y|x) = \frac{\exp(s_\theta(y, x)/\tau)}{\sum_{y' \in Y} \exp(s_\theta(y', x)/\tau)}$

$F(\theta) = -\frac{1}{n} \sum_{i} s_\theta(y_i, x_i) + \frac{\tau}{n} \sum_{i}\log\left[\sum_{y' \in Y} \exp(s_\theta(y', x_i)/\tau)\right]$

(Guo et al., 25 Feb 2025)

Implicit Negative Policy (NFT for Math Reasoning):

$\pi_\theta^{-}(A|q) = \frac{\pi(A|q) - \bar{r} \cdot \pi_\theta^{+}(A|q)}{1-\bar{r}}$

Objective uses positive and negative answers:

$L_{\mathrm{NFT}} = -\sum_{q,A,r} \omega(q) \sum_{t} \left[r \log R_\theta^{(t)}(A|q) + (1-r) \log\left( \max_v\left( \frac{1 - \bar{r} R_\theta^{(t)}(A|q)}{1 - \bar{r}}, \epsilon \right) \right) \right]$

(Chen et al., 23 May 2025)

Token-Level Forgetting Mechanism:

$\mathcal{L}(\theta) = \mathcal{L}_p - \lambda(\text{step}) \cdot \mathcal{L}_n$

$\mathcal{L}_p = \frac{1}{|\mathbb{P}|} \sum_{(i,j) \in \mathbb{P}} \ell(y_{i,j} | x_{i,:j}; \theta), \quad \mathcal{L}_n = \frac{1}{|\mathbb{N}|} \sum_{(i,j) \in \mathbb{N}} \ell(y_{i,j} | x_{i,:j}; \theta)$

(Ghahrizjani et al., 6 Aug 2025)

Table: Major NFT Algorithmic Elements

Approach	Negative Data Usage	Loss Mechanism
Discriminative FT	Penalize model likelihood of negatives	Softmax over positives/negatives
Token Forgetting	Explicitly unlearn negative tokens	Subtract scaled negative loss
Neuron Masking/Patch	Attenuate or restore negative neurons	Mask/patch neuron weights
Feedback Training	Multi-scale negative/noisy feedback	Feedback-augmented cross-entropy
Concept-wise FT	Maximize MI for rare/negative features	Patch-level contrastive, KL bottleneck

3. Methodologies and Implementation Strategies

NFT implementations span vision, language, and multi-modal domains, each tailoring the negative feedback or penalization process:

Token-wise negative identification: Influence functions are computed on tokens by comparing changes in cross-entropy loss pre- and post-fine-tuning; negative tokens actively repelled, positive tokens attracted (Ghahrizjani et al., 6 Aug 2025).
Neural mask fine-tuning: A scalar mask is assigned to each neuron's activation; only mask values are optimized to purge neuron-level backdoor triggers or overfitting, with MixUp data augmentation supporting robust mask inference (Karim et al., 2024).
Safety-critical patching: Low-rank SVD identifies safety-critical neurons; similarity scores determine which neurons must be patched using weights from a safety reference model; probability-based layer pruning enables efficient correction (Yi et al., 2024).
Concept-wise patch separation: Feature representations (patches) are maximally disentangled between positive and negative samples, maximizing MI for rare features and bottlenecking spurious ones through KL regularization (Yang et al., 2023).
Negative feedback outputs: Multi-scale noisy outputs (full inference passes, intermediate snapshots) serve as negative feedback, regularizing backbone output to resist hardware-induced stochasticity (Qin et al., 2023).
Discriminative or fault-aware losses: Dynamic importance weighting or preference-style scoring penalizes high-probability negative outputs (functionally incorrect code, undesirable generations) (Fan et al., 21 Mar 2025, Guo et al., 25 Feb 2025).

All methods utilize modifications to the objective function or model architecture to integrate negative feedback for improved generalization and robustness.

4. Empirical Results and Performance

NFT methods consistently report improvements over vanilla SFT and established baselines in both accuracy and robustness. Representative findings include:

On math reasoning datasets, NFT outperforms rejection sampling SFT (by margins of several percent) and matches or surpasses RL-based policy optimizers (GRPO, DAPO), with convergence speed and final accuracy comparable across 7B and 32B models (Chen et al., 23 May 2025).
Discriminative likelihood NFT matches or exceeds preference optimization pipelines (SFT → PO) in mathematical and instruction-following evaluations, without requiring external reward models or human-labeled preference data (Guo et al., 25 Feb 2025).
Fault-aware NFT for code generation demonstrates 6.9% pass@1 gains over conventional SFT in one epoch, and coordinated multi-granularity error segment identification yields up to 19.1% peak improvements across LLMs (Fan et al., 21 Mar 2025).
Token-level forgetting mechanism improves performance by 5.28% versus standard fine-tuning on LLaMA-3.2-3B, and maintains diversity—crucially, avoiding the discarding of entire sequences and enabling higher stability (Ghahrizjani et al., 6 Aug 2025).
Negative feedback training (NFT) with OVF and IRS yields top-1 accuracy improvements up to 46.71% (MNIST/CIFAR-10), suppresses epistemic uncertainty, and enhances convergence stability under large NVM-induced device variation (Qin et al., 2023).
Neural mask NFT efficiently removes backdoor triggers, with >90–99% attack success rate reduction while minimally affecting clean test accuracy—applicable across image, video, 3D, and NLP tasks (Karim et al., 2024).
Neuron-level safety realignment demonstrates >30-point reductions in harmfulness scores (SST2, AGNEWS) while maintaining baseline accuracy; dynamic neuron patching can be performed post-hoc without additional model training (Yi et al., 2024).

NFT approaches are typically more sample- and runtime-efficient, as demonstrated by neural mask tuning, negative policy parameterization, and selective patching.

NFT diverges from standard SFT, RL, and preference optimization in several respects:

Conventional SFT optimizes all tokens regardless of contribution, leading to overfitting or perpetuation of spurious patterns; NFT introduces explicit negative feedback for unlearning misleading tokens (Ghahrizjani et al., 6 Aug 2025).
RL operates via reward-driven policy gradients, requiring external verifier or reward models; NFT in math reasoning achieves theoretical gradient equivalence with RL (on-policy), yet performs updates using purely supervised losses (Chen et al., 23 May 2025).
Preference optimization (PO/DPO/etc.) requires extra data/human labels to guide learning; discriminative NFT frameworks achieve similar effect via likelihood suppression of negatives sampled directly from LLMs (Guo et al., 25 Feb 2025).
Adversarial fine-tuning (AFT) enhances robustness to perturbations via adversarial schedules (e.g., slow start/fast decay) and does not explicitly integrate negative feedback in the loss function, whereas NFT may combine negative example penalization and feedback-driven regularization (Jeddi et al., 2020).

NFT may introduce additional complexity in loss design, negative data identification, or patch selection, but these are offset by gains in generalization, robustness, and safety observed across experiments.

6. Practical Applications and Broader Implications

NFT has demonstrated practical value in the following areas:

Vision models: Out-of-distribution detection is strengthened via negative feature tuning and knowledge-regularized adaptation, achieving improved discrimination with low FPR95 and capacity for few-shot adaptation (Zhu et al., 26 Jul 2025).
LLMs: LLM alignment and safety is enhanced, reducing harmful outputs after adversarial fine-tuning by dynamically patching safety-related neurons (Yi et al., 2024).
Math/coding reasoning: Fault-aware and negative policy NFT improves precision on reasoning benchmarks and code generation, outperforming closed-source LLMs and classic SFT (Chen et al., 23 May 2025, Fan et al., 21 Mar 2025).
Backdoor defense: Neural mask NFT enables model purification with single-sample-per-class efficiency, outperforming prior adversarial and mask-based defenses (Karim et al., 2024).
Robust hardware adaptation: Negative feedback training ensures stability (zero convergence failures) in NVCIM accelerators under high device noise (Qin et al., 2023).
Diversity and knowledge boundary formation: Selective forgetting fosters a sharper knowledge boundary and more varied outputs without sacrificing coverage (Ghahrizjani et al., 6 Aug 2025).

NFT approaches are broadly applicable across modalities and domains, and future work may extend negative-aware frameworks to more dynamic feedback integration, layer-wise adaptation, and multi-modal regularization.

7. Future Directions

Emerging research directions for NFT include:

Extending implicit negative policy construction to masked LMs and dialogue models, enabling broader self-improvement without reliance on external teachers (Chen et al., 23 May 2025).
Exploring fine-grained neuron selection, suppression strategies, or hybrid mask/patch mechanisms for precision safety realignment after malicious fine-tuning (Yi et al., 2024, Xu et al., 2024).
Developing knowledge regularization methods to support negative feature separation while retaining generalizable pre-trained knowledge for OOD detection (Zhu et al., 26 Jul 2025).
Investigating dynamic or adaptive penalization schedules to balance exploration and exploitation (entropy preservation) in NFT (Ghahrizjani et al., 6 Aug 2025).
Applying multi-granularity NFT to new domains, such as factual QA, sentiment, and structural reasoning, capitalizing on dynamic error segment identification (Fan et al., 21 Mar 2025).
Theoretical analysis of off-policy behavior and gradient clipping for NFT, especially as models increase in scale and complexity (Chen et al., 23 May 2025).

In conclusion, NFT provides principled, empirically validated mechanisms for integrating and exploiting negative information at the data, feature, and parameter levels, catalyzing advances in robust, safe, and generalizable deep learning across vision, language, and multi-modal settings.