Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
95 tokens/sec
Gemini 2.5 Pro Premium
52 tokens/sec
GPT-5 Medium
31 tokens/sec
GPT-5 High Premium
22 tokens/sec
GPT-4o
100 tokens/sec
DeepSeek R1 via Azure Premium
98 tokens/sec
GPT OSS 120B via Groq Premium
436 tokens/sec
Kimi K2 via Groq Premium
209 tokens/sec
2000 character limit reached

AnalogSeeker: Analog Circuit Design LLM

Updated 16 August 2025
  • AnalogSeeker is an open-source language model specifically designed for analog circuit design using a curated corpus and multi-agent QTSA framework.
  • It employs granular knowledge distillation by decomposing textbook content into exam-style Q–A pairs to enhance training effectiveness.
  • The model utilizes Neighborhood Self-Constrained Supervised Fine-Tuning to balance domain adaptation with foundational LLM capabilities, achieving 85.04% accuracy on benchmarks.

AnalogSeeker is an open-source foundation LLM developed specifically to address the unique data scarcity, knowledge complexity, and automation requirements of analog circuit design. Built atop a large-scale, high-quality textual corpus curated from canonical analog circuit textbooks, AnalogSeeker employs a multi-agent granular knowledge distillation method and a principled fine-tuning-centric training paradigm, introducing innovations in both training methodology and dataset construction. The model achieves state-of-the-art accuracy on dedicated analog knowledge benchmarks and demonstrates practical downstream utility in complex analog design tasks. AnalogSeeker is freely available for research at https://huggingface.co/analogLLM/analogseeker (Chen et al., 14 Aug 2025).

1. Domain-Specific Corpus Collection

AnalogSeeker’s core is rooted in a systematically assembled “textual domain corpus” that comprehensively represents the analog circuit body of knowledge. The framework for corpus collection is explicitly structured into four ascending stages:

  • Circuit theory: Covers foundational passive network laws, network analysis, and both time and frequency domain analysis.
  • Analog circuit basis: Encompasses device characteristics (e.g., MOSFET, BJT), fundamental amplifiers, feedback theory, and stability analysis.
  • Analog integrated circuits: Focuses on concrete modules such as operational amplifiers, comparators, and integrated CMOS design practices.
  • Advanced circuit topics: Addresses specialized modules such as phase-locked loops (PLLs).

Twenty canonical textbooks spanning at least a dozen key analog circuit types were curated and segmented both by learning stage and circuit class. A commercial OCR and structure extraction pipeline (Mathpix) yielded a Markdown corpus of 7.26 million tokens, hierarchically structured by chapter and subsection, with explicit extraction and anonymization of mathematical expressions and figures.

2. Granular Knowledge Distillation via Multi-Agent Framework

To translate the dense, multifaceted textbook information into a machine-learnable supervision source, AnalogSeeker introduces a “learning node”–driven decomposition. Each subsection of the corpus (typically ~2000 tokens; 2,698 nodes in total) is transformed by a multi-agent framework into datasets of question–answer pairs, explicitly encoding reasoning steps.

The QTSA (Question–Thinking–Solution–Answer) format is employed:

  • Q(i): Agent Ψ_Q generates an exam-style question directly from the learning node.
  • T(i) and S(i): Agent Ψ_A, given the node and Q(i), outputs a detailed reasoning process (> …), sequential solution steps (<solution>…</solution>), and
  • A(i): The answer (<answer>…</answer>).

Each node is sampled Nₛ = 5 times, and standardized by a post-processing agent Ψ_P, ensuring robustness to formatting and over-specific references. This yields a fine-grained, explicit, and high-quality supervised training dataset (112.65M tokens).

3. Model Architecture and Training Paradigm

AnalogSeeker fine-tunes the Qwen2.5-32B-Instruct LLM, adopting a training strategy informed by both corpus size and domain complexity:

  • Fine-tuning–centric paradigm: Given the modest size of textbook-derived unsupervised corpus, the approach eschews continual pre-training (CPT) in favor of focused supervised fine-tuning (SFT) on the distilled QTSA data. Experiments show <1% absolute improvement from CPT+SFT versus SFT alone, highlighting that classical pre-training is not cost-effective at this corpus scale.
  • Instruct Model Preference: “Instruct” LLMs (e.g., Qwen2.5-32B-Instruct) proved more robust to further fine-tuning than “reasoning models,” with reward-optimized parameters exhibiting fragility during domain adaptation.
  • Empirical and theoretical validation: Ablations confirm that SFT is the main contributor to analog circuit knowledge transfer, and that instruct models balance adaptability and capacity retention.

4. Neighborhood Self-Constrained Supervised Fine-Tuning (NSC-SFT)

To ensure effective domain adaptation without catastrophic forgetting of foundational LLM capabilities, AnalogSeeker employs NSC-SFT, which regularizes the fine-tuning trajectory:

  • Loss formulation:

L=LCE(ypredict,ylabel)+λDKL(ppredictpref)L = L_{CE}(y_{\text{predict}}, y_{\text{label}}) + \lambda \cdot D_{KL}(p_{\text{predict}} \,\|\, p_{\text{ref}})

  • LCEL_{CE}: cross-entropy loss for standard supervised learning.
  • DKL(ppredict pref)D_{KL}(p_{\text{predict}}\|\ p_{\text{ref}}): Kullback–Leibler divergence between the current (fine-tuned) output distribution and the reference (pre-trained) model, over the entire output vocabulary VV, at each input xx:

    DKL(pθ(vx) pθ0(vx))=vVpθ(vx)logpθ(vx)pθ0(vx)D_{KL}(p_{\theta}(v|x)\|\ p_{\theta_0}(v|x)) = \sum_{v \in V} p_{\theta}(v|x) \cdot \log\frac{p_{\theta}(v|x)}{p_{\theta_0}(v|x)}

  • λ\lambda is a tunable hyperparameter.
    • Engineering realization: Memory-peak optimization is required for 32B models with long contexts (8192 tokens), accomplished by asymmetric memory management (reference model resident on every GPU, target model distributed with DeepSpeed ZeRO-3) and careful tensor deletion.
    • Convergence guarantee: Standard analysis of smooth composite losses guarantees convergence to stationary points for learning rates in (0,1/L)(0,1/L), LL being the Lipschitz constant.

5. Performance and Evaluation Benchmarks

AnalogSeeker establishes state-of-the-art performance on AMSBench-TQA, a benchmark designed for textual QA in analog circuit knowledge:

  • Accuracy: 85.04%, an absolute improvement of 15.67 percentage points over the Qwen2.5-32B-Instruct baseline and outperforming both reasoning LLMs (QwQ-32B, 81.54%) and commercial models (GPT-4o, 73.99%; DeepSeek-v3, 84.41%).

On downstream operational amplifier design within the Atelier framework:

  • Capabilities: Iterative topology design, topology modification (e.g., introducing series nulling resistors to improve phase margin), and expert-level circuit analysis in natural language.
  • Trajectory documentation: Full design trajectories detail the agent’s reasoning and decision-making steps, including handling of phase margin and output swing.

6. Open Research Resource and Impact

AnalogSeeker is open-sourced for research use on HuggingFace (https://huggingface.co/analogLLM/analogseeker). This public availability:

  • Facilitates reproducibility by providing access to model weights, QTSA dataset construction protocol, and training recipes.
  • Enables domain adaptation and integration, e.g., fine-tuning for particular analog subfields or integration with frameworks such as Atelier.
  • Accelerates research in analog EDA automation, especially in data-constrained domains where classical LLMs lack sufficient prior.

This approach represents a marked advance in the methodological rigor of analog circuit LLM development, coupling fine-grained textbook data, explicit multi-agent knowledge distillation, purposefully constrained fine-tuning, and benchmarked validation.

7. Summary Table: AnalogSeeker Foundation Model

Feature Description Quantitative Result/Value
Corpus size Curated textbook corpus; Markdown, 7.26M tokens 20 books, 2,698 learning nodes
Knowledge distillation format Multi-agent QTSA (Q–Think–Solution–Answer) 112.65M tokens distilled SFT data
Model base Qwen2.5-32B-Instruct 32B, 8192-token context
Fine-tuning method NSC-SFT (adds KL divergence constraint) L=LCE+λDKLL = L_{CE} + \lambda D_{KL}
AMSBench-TQA accuracy Analog circuit QA benchmark 85.04%
Improvement over baseline Compared to Qwen2.5-32B-Instruct and commercial models +15.67, +0.63, and +11.05 points
Availability Open-source release on HuggingFace https://huggingface.co/analogLLM/analogseeker

The AnalogSeeker initiative provides both a rigorous engineering exemplar and a practical, open tool for the analog circuit design research community, demonstrating that domain-specific LLMs can close performance gaps in fields with structural data scarcity and compositional complexity.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)