Papers
Topics
Authors
Recent
Search
2000 character limit reached

NSC-SFT: Domain Adaptation for LLMs

Updated 17 March 2026
  • NSC-SFT is a fine-tuning method that integrates a KL divergence penalty to mitigate catastrophic forgetting while adapting LLMs to specialized, data-scarce domains.
  • It employs a dual-model setup where only the target model parameters are tuned, effectively preserving general-domain knowledge and achieving high task-specific accuracy (e.g., 85.04%).
  • Empirical results, as seen in AnalogSeeker for analog circuit design, demonstrate that NSC-SFT outperforms traditional SFT and reasoning models by maintaining robust generalization.

Neighborhood Self-Constrained Supervised Fine-Tuning (NSC-SFT) is a fine-tuning methodology developed to address the challenges of maintaining general-domain capabilities while adapting LLMs to highly specialized, data-scarce domains. This approach introduces a regularization constraint—specifically, a Kullback–Leibler (KL) divergence penalty between the current and reference (pre-trained) model output distributions—to mitigate catastrophic forgetting and overfitting during supervised adaptation. NSC-SFT was proposed and validated in the context of AnalogSeeker, an open-source foundation LLM for analog circuit design, yielding significant empirical improvements in task-specific accuracy and domain generalization (Chen et al., 14 Aug 2025).

1. Rationale and Core Principles

NSC-SFT arises from the need to balance the acquisition of new domain-specific knowledge with the retention of broad, general linguistic and problem-solving capabilities in LLMs. Traditional supervised fine-tuning (SFT) approaches risk catastrophic forgetting of prior knowledge, especially when the target corpus is small and highly domain-specific. NSC-SFT constrains the magnitude of output distribution perturbations from the reference model—typically the pre-trained instruct model—by augmenting the loss function with a KL divergence term.

The loss function optimized in NSC-SFT is: L=LCE(ypred,ylabel) + λ DKL(ppred∥pref)\mathcal{L} = \mathcal{L}_{\mathrm{CE}}(y_\mathrm{pred}, y_\mathrm{label}) \, + \, \lambda \, D_{\mathrm{KL}}\left(p_\mathrm{pred} \parallel p_\mathrm{ref} \right) where LCE\mathcal{L}_{\mathrm{CE}} is standard cross-entropy loss, λ\lambda is the regularization weight (empirically set to $0.1$), ppredp_\mathrm{pred} is the current model’s output distribution, and prefp_\mathrm{ref} is the frozen reference model’s output. This explicit constraint ensures that fine-tuning primarily updates parameters relevant to new tasks while limiting undesirable shifts in broader knowledge representations (Chen et al., 14 Aug 2025).

2. Implementation Framework and Computational Considerations

NSC-SFT is implemented over a decoder-only transformer architecture. The AnalogSeeker instance initializes from Qwen2.5-32B-Instruct, utilizing identical vocabulary and tokenization pipelines as the Qwen2.5 family; position and rotary embeddings are preserved without architectural alteration.

During fine-tuning, a reference copy of the original pre-trained model resides on each GPU to enable efficient KL calculation. Only the parameters of the target model are sharded, improving memory utilization. Empirical resource requirements on 8× H200 GPUs were reported as follows:

  • Forward + KL divergence: theoretical peak ≈92\approx 92 GB per GPU
  • Backward pass: theoretical peak ≈110\approx 110 GB, actual usage ≈101\approx 101 GB

Other key hyperparameters include:

  • Maximum learning rate: 2×10−62\times10^{-6}, cosine annealing with 10% warmup
  • Batch size: 1 sequence per GPU × 8 GPUs = global 64
  • Maximum sequence length: 8,192 tokens (packed)
  • BF16 precision; ZeRO-3 optimization; flash-attention disabled for KL stability

This setup enables tractable training even with dual 32B-parameter model instantiation (Chen et al., 14 Aug 2025).

3. Empirical Performance and Domain Adaptation

NSC-SFT was validated on AMSBench-TQA, a benchmark for analog circuit Q&A. The results substantiate its efficacy:

Model (Start Point) Method Accuracy (%)
Qwen2.5-32B-Instruct (AnalogSeeker) NSC-SFT 85.04
Qwen2.5-32B-Instruct CPT+NSC-SFT 84.49
DeepSeek-v3 (671B) — 84.41
Qwen2.5-32B-Instruct CPT+SFT 82.74
Qwen2.5-32B-Instruct SFT 82.34
QwQ-32B (reasoning model) — 81.54
QwQ-32B SFT 74.94
GPT-4o — 73.99
Qwen2.5-32B-Instruct CPT 71.20
Qwen2.5-32B-Instruct — 69.37

AnalogSeeker, trained with NSC-SFT, outperformed its base model by +15.67 percentage points and exceeded GPT-4o by +11.05 percentage points. The KL-regularized regime demonstrably mitigates catastrophic forgetting, as pure SFT on reasoning models (e.g., QwQ-32B) degraded performance from 81.54%81.54\% to 60.50%60.50\%—an effect absent under NSC-SFT.

4. Domain-Specific Corpus and Fine-Tuning Workflow

Given the scarcity of online analog-circuit text, corpus construction involved curating and cleaning 20 canonical textbooks, stratified into four knowledge stages: circuit theory, analog circuit fundamentals, integrated-circuit design, and advanced topics. This pipeline yielded:

  • Clean Markdown corpus: 7.26M tokens (unlabeled)
  • Labeled SFT dataset: 15,310 QTSA (Question–Trace–Solution–Answer) entries, 112.65M tokens

Granular knowledge distillation used multi-agent decomposition into learning nodes, with chain-of-thought and structured solution step extraction. Supervised fine-tuning utilized a 1:1 mix of these analog-domain samples and 20,000 OpenThoughts samples (general-domain), maximizing the utility of limited specialty data without incurring excessive domain overfitting.

5. Catastrophic Forgetting Mitigation and Regularization Dynamics

The inclusion of a KL divergence constraint is central to the mitigation of catastrophic forgetting under SFT. Empirical results show that NSC-SFT prevents the trade-off between generalist and specialist task competency observed in prior approaches. When SFT was applied without this constraint, models targeting logically complex domains exhibited significant regression on general knowledge benchmarks; NSC-SFT stably preserves the base model’s capabilities while facilitating robust domain adaptation.

The regularization coefficient (λ=0.1\lambda = 0.1) was selected empirically for optimal balance. Memory efficiency is conditional on careful GPU resource allocation, as both the reference and target models must be maintained in memory throughout training. This imposes a practical limitation for even larger model scales or multi-domain joint adaptation.

6. Downstream Application Evidence and Practical Insights

AnalogSeeker’s NSC-SFT training protocol enabled effective participation in multi-agent design tasks (e.g., operational amplifier design in the Atelier framework). The model executed iterative topology optimization—selecting nested Miller compensation, refining phase margin with nulling resistors, and converging on industry-standard figures of merit—while providing chain-of-thought rationale and generating netlists.

Best practices arising from this work highlight the value of starting from an instruct model (versus base or reasoning models), emphasizing high-quality SFT data, and enforcing KL-based output distribution constraints to promote stable, high-fidelity domain adaptation.

7. Availability and Impact Prospects

AnalogSeeker, trained with NSC-SFT, is released under a research-only license at https://huggingface.co/analogllm/analogseeker. The methodology is anticipated to inform subsequent efforts to equip LLMs for specialized engineering domains, particularly where data is limited and consistent generalization is essential (Chen et al., 14 Aug 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Neighborhood Self-Constrained Supervised Fine-Tuning (NSC-SFT).