NSC-SFT: Domain Adaptation for LLMs

Updated 17 March 2026

NSC-SFT is a fine-tuning method that integrates a KL divergence penalty to mitigate catastrophic forgetting while adapting LLMs to specialized, data-scarce domains.
It employs a dual-model setup where only the target model parameters are tuned, effectively preserving general-domain knowledge and achieving high task-specific accuracy (e.g., 85.04%).
Empirical results, as seen in AnalogSeeker for analog circuit design, demonstrate that NSC-SFT outperforms traditional SFT and reasoning models by maintaining robust generalization.

Neighborhood Self-Constrained Supervised Fine-Tuning (NSC-SFT) is a fine-tuning methodology developed to address the challenges of maintaining general-domain capabilities while adapting LLMs to highly specialized, data-scarce domains. This approach introduces a regularization constraint—specifically, a Kullback–Leibler (KL) divergence penalty between the current and reference (pre-trained) model output distributions—to mitigate catastrophic forgetting and overfitting during supervised adaptation. NSC-SFT was proposed and validated in the context of AnalogSeeker, an open-source foundation LLM for analog circuit design, yielding significant empirical improvements in task-specific accuracy and domain generalization (Chen et al., 14 Aug 2025).

1. Rationale and Core Principles

NSC-SFT arises from the need to balance the acquisition of new domain-specific knowledge with the retention of broad, general linguistic and problem-solving capabilities in LLMs. Traditional supervised fine-tuning (SFT) approaches risk catastrophic forgetting of prior knowledge, especially when the target corpus is small and highly domain-specific. NSC-SFT constrains the magnitude of output distribution perturbations from the reference model—typically the pre-trained instruct model—by augmenting the loss function with a KL divergence term.

The loss function optimized in NSC-SFT is: $\mathcal{L} = \mathcal{L}_{\mathrm{CE}}(y_\mathrm{pred}, y_\mathrm{label}) \, + \, \lambda \, D_{\mathrm{KL}}\left(p_\mathrm{pred} \parallel p_\mathrm{ref} \right)$ where $\mathcal{L}_{\mathrm{CE}}$ is standard cross-entropy loss, $\lambda$ is the regularization weight (empirically set to $0.1$), $p_\mathrm{pred}$ is the current model’s output distribution, and $p_\mathrm{ref}$ is the frozen reference model’s output. This explicit constraint ensures that fine-tuning primarily updates parameters relevant to new tasks while limiting undesirable shifts in broader knowledge representations (Chen et al., 14 Aug 2025).

2. Implementation Framework and Computational Considerations

NSC-SFT is implemented over a decoder-only transformer architecture. The AnalogSeeker instance initializes from Qwen2.5-32B-Instruct, utilizing identical vocabulary and tokenization pipelines as the Qwen2.5 family; position and rotary embeddings are preserved without architectural alteration.

During fine-tuning, a reference copy of the original pre-trained model resides on each GPU to enable efficient KL calculation. Only the parameters of the target model are sharded, improving memory utilization. Empirical resource requirements on 8× H200 GPUs were reported as follows:

Forward + KL divergence: theoretical peak $\approx 92$  GB per GPU
Backward pass: theoretical peak $\approx 110$  GB, actual usage $\approx 101$  GB

Other key hyperparameters include:

Maximum learning rate: $2\times10^{-6}$ , cosine annealing with 10% warmup
Batch size: 1 sequence per GPU × 8 GPUs = global 64
Maximum sequence length: 8,192 tokens (packed)
BF16 precision; ZeRO-3 optimization; flash-attention disabled for KL stability

This setup enables tractable training even with dual 32B-parameter model instantiation (Chen et al., 14 Aug 2025).

3. Empirical Performance and Domain Adaptation

NSC-SFT was validated on AMSBench-TQA, a benchmark for analog circuit Q&A. The results substantiate its efficacy:

Model (Start Point)	Method	Accuracy (%)
Qwen2.5-32B-Instruct (AnalogSeeker)	NSC-SFT	85.04
Qwen2.5-32B-Instruct	CPT+NSC-SFT	84.49
DeepSeek-v3 (671B)	—	84.41
Qwen2.5-32B-Instruct	CPT+SFT	82.74
Qwen2.5-32B-Instruct	SFT	82.34
QwQ-32B (reasoning model)	—	81.54
QwQ-32B	SFT	74.94
GPT-4o	—	73.99
Qwen2.5-32B-Instruct	CPT	71.20
Qwen2.5-32B-Instruct	—	69.37

AnalogSeeker, trained with NSC-SFT, outperformed its base model by +15.67 percentage points and exceeded GPT-4o by +11.05 percentage points. The KL-regularized regime demonstrably mitigates catastrophic forgetting, as pure SFT on reasoning models (e.g., QwQ-32B) degraded performance from $\mathcal{L}_{\mathrm{CE}}$ 0 to $\mathcal{L}_{\mathrm{CE}}$ 1—an effect absent under NSC-SFT.

4. Domain-Specific Corpus and Fine-Tuning Workflow

Given the scarcity of online analog-circuit text, corpus construction involved curating and cleaning 20 canonical textbooks, stratified into four knowledge stages: circuit theory, analog circuit fundamentals, integrated-circuit design, and advanced topics. This pipeline yielded:

Clean Markdown corpus: 7.26M tokens (unlabeled)
Labeled SFT dataset: 15,310 QTSA (Question–Trace–Solution–Answer) entries, 112.65M tokens

Granular knowledge distillation used multi-agent decomposition into learning nodes, with chain-of-thought and structured solution step extraction. Supervised fine-tuning utilized a 1:1 mix of these analog-domain samples and 20,000 OpenThoughts samples (general-domain), maximizing the utility of limited specialty data without incurring excessive domain overfitting.

5. Catastrophic Forgetting Mitigation and Regularization Dynamics

The inclusion of a KL divergence constraint is central to the mitigation of catastrophic forgetting under SFT. Empirical results show that NSC-SFT prevents the trade-off between generalist and specialist task competency observed in prior approaches. When SFT was applied without this constraint, models targeting logically complex domains exhibited significant regression on general knowledge benchmarks; NSC-SFT stably preserves the base model’s capabilities while facilitating robust domain adaptation.

The regularization coefficient ( $\mathcal{L}_{\mathrm{CE}}$ 2) was selected empirically for optimal balance. Memory efficiency is conditional on careful GPU resource allocation, as both the reference and target models must be maintained in memory throughout training. This imposes a practical limitation for even larger model scales or multi-domain joint adaptation.

6. Downstream Application Evidence and Practical Insights

AnalogSeeker’s NSC-SFT training protocol enabled effective participation in multi-agent design tasks (e.g., operational amplifier design in the Atelier framework). The model executed iterative topology optimization—selecting nested Miller compensation, refining phase margin with nulling resistors, and converging on industry-standard figures of merit—while providing chain-of-thought rationale and generating netlists.

Best practices arising from this work highlight the value of starting from an instruct model (versus base or reasoning models), emphasizing high-quality SFT data, and enforcing KL-based output distribution constraints to promote stable, high-fidelity domain adaptation.

7. Availability and Impact Prospects

AnalogSeeker, trained with NSC-SFT, is released under a research-only license at https://huggingface.co/analogllm/analogseeker. The methodology is anticipated to inform subsequent efforts to equip LLMs for specialized engineering domains, particularly where data is limited and consistent generalization is essential (Chen et al., 14 Aug 2025).

Markdown Report Issue Upgrade to Chat

References (1)

AnalogSeeker: An Open-source Foundation Language Model for Analog Circuit Design (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Neighborhood Self-Constrained Supervised Fine-Tuning (NSC-SFT).