AnalogSeeker: Analog Circuit Design LLM
- AnalogSeeker is an open-source language model specifically designed for analog circuit design using a curated corpus and multi-agent QTSA framework.
- It employs granular knowledge distillation by decomposing textbook content into exam-style Q–A pairs to enhance training effectiveness.
- The model utilizes Neighborhood Self-Constrained Supervised Fine-Tuning to balance domain adaptation with foundational LLM capabilities, achieving 85.04% accuracy on benchmarks.
AnalogSeeker is an open-source foundation LLM developed specifically to address the unique data scarcity, knowledge complexity, and automation requirements of analog circuit design. Built atop a large-scale, high-quality textual corpus curated from canonical analog circuit textbooks, AnalogSeeker employs a multi-agent granular knowledge distillation method and a principled fine-tuning-centric training paradigm, introducing innovations in both training methodology and dataset construction. The model achieves state-of-the-art accuracy on dedicated analog knowledge benchmarks and demonstrates practical downstream utility in complex analog design tasks. AnalogSeeker is freely available for research at https://huggingface.co/analogLLM/analogseeker (Chen et al., 14 Aug 2025).
1. Domain-Specific Corpus Collection
AnalogSeeker’s core is rooted in a systematically assembled “textual domain corpus” that comprehensively represents the analog circuit body of knowledge. The framework for corpus collection is explicitly structured into four ascending stages:
- Circuit theory: Covers foundational passive network laws, network analysis, and both time and frequency domain analysis.
- Analog circuit basis: Encompasses device characteristics (e.g., MOSFET, BJT), fundamental amplifiers, feedback theory, and stability analysis.
- Analog integrated circuits: Focuses on concrete modules such as operational amplifiers, comparators, and integrated CMOS design practices.
- Advanced circuit topics: Addresses specialized modules such as phase-locked loops (PLLs).
Twenty canonical textbooks spanning at least a dozen key analog circuit types were curated and segmented both by learning stage and circuit class. A commercial OCR and structure extraction pipeline (Mathpix) yielded a Markdown corpus of 7.26 million tokens, hierarchically structured by chapter and subsection, with explicit extraction and anonymization of mathematical expressions and figures.
2. Granular Knowledge Distillation via Multi-Agent Framework
To translate the dense, multifaceted textbook information into a machine-learnable supervision source, AnalogSeeker introduces a “learning node”–driven decomposition. Each subsection of the corpus (typically ~2000 tokens; 2,698 nodes in total) is transformed by a multi-agent framework into datasets of question–answer pairs, explicitly encoding reasoning steps.
The QTSA (Question–Thinking–Solution–Answer) format is employed:
- Q(i): Agent Ψ_Q generates an exam-style question directly from the learning node.
- T(i) and S(i): Agent Ψ_A, given the node and Q(i), outputs a detailed reasoning process (> …), sequential solution steps (<solution>…</solution>), and
- A(i): The answer (<answer>…</answer>).
Each node is sampled Nₛ = 5 times, and standardized by a post-processing agent Ψ_P, ensuring robustness to formatting and over-specific references. This yields a fine-grained, explicit, and high-quality supervised training dataset (112.65M tokens).
3. Model Architecture and Training Paradigm
AnalogSeeker fine-tunes the Qwen2.5-32B-Instruct LLM, adopting a training strategy informed by both corpus size and domain complexity:
- Fine-tuning–centric paradigm: Given the modest size of textbook-derived unsupervised corpus, the approach eschews continual pre-training (CPT) in favor of focused supervised fine-tuning (SFT) on the distilled QTSA data. Experiments show <1% absolute improvement from CPT+SFT versus SFT alone, highlighting that classical pre-training is not cost-effective at this corpus scale.
- Instruct Model Preference: “Instruct” LLMs (e.g., Qwen2.5-32B-Instruct) proved more robust to further fine-tuning than “reasoning models,” with reward-optimized parameters exhibiting fragility during domain adaptation.
- Empirical and theoretical validation: Ablations confirm that SFT is the main contributor to analog circuit knowledge transfer, and that instruct models balance adaptability and capacity retention.
4. Neighborhood Self-Constrained Supervised Fine-Tuning (NSC-SFT)
To ensure effective domain adaptation without catastrophic forgetting of foundational LLM capabilities, AnalogSeeker employs NSC-SFT, which regularizes the fine-tuning trajectory:
- Loss formulation:
- : cross-entropy loss for standard supervised learning.
- : Kullback–Leibler divergence between the current (fine-tuned) output distribution and the reference (pre-trained) model, over the entire output vocabulary , at each input :
- is a tunable hyperparameter.
- Engineering realization: Memory-peak optimization is required for 32B models with long contexts (8192 tokens), accomplished by asymmetric memory management (reference model resident on every GPU, target model distributed with DeepSpeed ZeRO-3) and careful tensor deletion.
- Convergence guarantee: Standard analysis of smooth composite losses guarantees convergence to stationary points for learning rates in , being the Lipschitz constant.
5. Performance and Evaluation Benchmarks
AnalogSeeker establishes state-of-the-art performance on AMSBench-TQA, a benchmark designed for textual QA in analog circuit knowledge:
- Accuracy: 85.04%, an absolute improvement of 15.67 percentage points over the Qwen2.5-32B-Instruct baseline and outperforming both reasoning LLMs (QwQ-32B, 81.54%) and commercial models (GPT-4o, 73.99%; DeepSeek-v3, 84.41%).
On downstream operational amplifier design within the Atelier framework:
- Capabilities: Iterative topology design, topology modification (e.g., introducing series nulling resistors to improve phase margin), and expert-level circuit analysis in natural language.
- Trajectory documentation: Full design trajectories detail the agent’s reasoning and decision-making steps, including handling of phase margin and output swing.
6. Open Research Resource and Impact
AnalogSeeker is open-sourced for research use on HuggingFace (https://huggingface.co/analogLLM/analogseeker). This public availability:
- Facilitates reproducibility by providing access to model weights, QTSA dataset construction protocol, and training recipes.
- Enables domain adaptation and integration, e.g., fine-tuning for particular analog subfields or integration with frameworks such as Atelier.
- Accelerates research in analog EDA automation, especially in data-constrained domains where classical LLMs lack sufficient prior.
This approach represents a marked advance in the methodological rigor of analog circuit LLM development, coupling fine-grained textbook data, explicit multi-agent knowledge distillation, purposefully constrained fine-tuning, and benchmarked validation.
7. Summary Table: AnalogSeeker Foundation Model
Feature | Description | Quantitative Result/Value |
---|---|---|
Corpus size | Curated textbook corpus; Markdown, 7.26M tokens | 20 books, 2,698 learning nodes |
Knowledge distillation format | Multi-agent QTSA (Q–Think–Solution–Answer) | 112.65M tokens distilled SFT data |
Model base | Qwen2.5-32B-Instruct | 32B, 8192-token context |
Fine-tuning method | NSC-SFT (adds KL divergence constraint) | |
AMSBench-TQA accuracy | Analog circuit QA benchmark | 85.04% |
Improvement over baseline | Compared to Qwen2.5-32B-Instruct and commercial models | +15.67, +0.63, and +11.05 points |
Availability | Open-source release on HuggingFace | https://huggingface.co/analogLLM/analogseeker |
The AnalogSeeker initiative provides both a rigorous engineering exemplar and a practical, open tool for the analog circuit design research community, demonstrating that domain-specific LLMs can close performance gaps in fields with structural data scarcity and compositional complexity.