LLM-Based Specification Inference

Updated 14 February 2026

LLM-Based Specification Inference is a technique that automatically extracts formal hardware and software specifications by translating natural language and multimodal inputs via LLMs and symbolic reasoning.
By leveraging modality-aware preprocessing and chain-of-thought prompting, the approach delivers high syntax validity and functional coverage, with reported metrics like 97.8% syntax accuracy in hardware verification.
Hybrid neuro-symbolic pipelines and mutation-based evaluation loops enable iterative refinement and robust error mitigation, addressing potential issues such as oversimplification and hallucination.

LLM-Based Specification Inference refers to the automatic extraction or synthesis of formal and semi-formal specifications for hardware and software systems using large-scale neural LLMs, often augmented by symbolic or algorithmic reasoning. LLM-based methods have recently demonstrated significant progress in translating natural-language documentation, multimodal design artifacts, or imperative code into machine-verifiable specifications, assertions, and contracts. Applications span assertion-based hardware verification, formal specification of APIs, code-level contract inference, and logic-based knowledge extraction. Recent frameworks combine chain-of-thought (CoT) prompting, semantic preprocessing, mutation-driven refinement, and neuro-symbolic integration to scale beyond previous rule-based and template-driven approaches.

1. Modality-Aware and Semantic Preprocessing

State-of-the-art frameworks such as AssertCoder systematically decompose heterogeneous artifacts—including text, tables, diagrams (FSMs, timing waveforms), and formulas—into atomic content blocks with precisely annotated modality and semantic categories. Each document is segmented into blocks $S = \{b_1, \ldots, b_N\}$ , with each $b_i$ tagged by modality $m_i$ and semantic class $c_i$ (e.g., Architecture, Timing Behavior). LLMs are prompted to perform modality-sensitive normalization, converting:

text spans to standardized UTF-8;
tables to structured JSON arrays;
formulas to operator trees;
diagrams to labeled graphs extracted via OCR and structured as edges/nodes.

These normalized blocks are then dispatched to specialized analyzers: text handles interface extraction, diagram paths yield formal transitions or timing logic, formulas are parsed into canonical forms, and tables yield parameter lists. A final merge step reconstructs signal-level semantic records, mapping every observed register or signal to a tuple containing attributes, behavioral semantics, temporal constraints, and explicit source traceability, yielding a set $\mathrm{Spec} = \{s_1, \ldots, s_n\}$ , where each $s_k$ is a structured semantic entity (Tian et al., 14 Jul 2025).

2. Prompt Engineering and Chain-of-Thought-Driven Synthesis

LLM-based specification inference leverages sophisticated prompting strategies, typically in multi-stage or chain-of-thought (CoT) formats, to drive specification materialization from intermediate representations. In AssertCoder, assertion synthesis follows a four-step sequence:

Semantic decomposition: extract {precondition, consequence, timing window} for each signal’s intent.
Pattern selection: map semantic structure to assertion types (implication, stability, invariant).
Temporal binding: translate loosely specified timing constraints to formal property language constructs (e.g., SystemVerilog ‘##[k:l]’, hold semantics [*n]).
Syntax generation: programmatically assemble the formal property in assertion language.

This CoT mechanism, frequently augmented by retrieval-augmented generation (RAG) to provide domain exemplars, achieves higher syntax validity and functional coverage compared to zero-shot methods. Empirical ablations confirm that removing CoT logic reduces the generation of functionally valid assertions by a significant margin (Tian et al., 14 Jul 2025).

3. Mutation-Based and Proof-Driven Evaluation Loops

Evaluation and refinement are typically realized via mutation analysis or integration with deductive verification tools. In AssertCoder, assertions $A = \{a_1, \ldots, a_n\}$ are systematically evaluated against a set of mutated designs $M = \{m_1, \ldots, m_k\}$ . Key metrics include:

Mutation Detection Rate (MDR): fraction of mutants detected by any assertion.
AvgMutationScore: mean mutants detected per assertion.
False Positive Rate (FPR): proportion of candidate assertions rejected on the original, unmutated design but failing semantic review.

Any assertion failing to detect mutants is iteratively discarded or refined, with undetected mutants’ contexts injected back into the CoT prompt as negative examples. This closed-loop “generate → check → refine” converges when no new mutants are identified or a resource budget is exhausted (Tian et al., 14 Jul 2025).

In software verification, similar iterative refinement emerges in LLM-driven JML inference. KeY or Frama-C deductive engines filter generated loop invariants and contract annotations, with error feedback (open proof branches or syntax errors) provided to the LLM for guided correction or fresh resampling. This mixed-strategy meta-algorithm increases the synthesis success rate to ≈90% for loop invariants (Teuber et al., 3 Feb 2025).

4. Hybrid Neuro-Symbolic Specification Synthesis

The combination of LLMs with formal and symbolic pre-analysis structures a hybrid pipeline that enhances both abstraction and correctness:

Concrete test cases (PathCrawler) are embedded as prompt context, enabling the LLM to produce context-aware ensures clauses abstracted over observed I/O behaviors.
Static analysis alarms (EVA) produce precise preconditions, steering the model to eliminate runtime errors (e.g., by forbidding overflow and out-of-bounds access).
Neuro-symbolic pipelines allow explicit control over intent- or implementation-oriented specification targets. For an ambiguous or buggy function $f$ , prompting can select between $\phi_{\text{impl}}$ (permissive of all runtime behavior) or $\phi_{\text{int}}$ (formalizing intended semantics), with bug-detection recall of up to 0.93 in favorable cases (Granberry et al., 29 Apr 2025, Granberry et al., 2024).

Logical deletion—a lightweight LLM-based self-verification step—further prunes candidate annotations by asking the LLM to validate whether each invariant or contract holds on the relevant code slice, increasing specification relevance and reducing overpruning from strictly proof-obligation-based filtering (Chen et al., 12 Sep 2025).

5. Application Domains and Quantitative Performance

LLM-based specification inference has been validated across diverse targets:

Domain	Framework	Target Spec	Functional Coverage/Accuracy	Notable Features
Hardware RTL	AssertCoder	SVA (SystemVerilog)	97.8% syntax, 87.6% func. corr., 85.6% MDR, FPR 3.2% (Tian et al., 14 Jul 2025)	Multimodal, mutation-guided, CoT
Hardware RTL	AssertLLM	SVA	89% syntax+func.	Multi-agent, RAG, waveform-img support
C Verification	SLD-Spec	ACSL	90.9% program pass, 95.1% assertion verified (Chen et al., 12 Sep 2025)	Slicing, logical deletion
C Verification	Deepseek-R1+Symb	ACSL	Bug-detection recall up to 0.93, clause counts modulated by context (Granberry et al., 29 Apr 2025)	Neuro-symbolic, configurable intent
Java Verification	(KeY-based)	JML	≈90% invariant generation on benchmarks (Teuber et al., 3 Feb 2025)	Feedback-based, deductive oracle
LTL Extraction	Two-stage LLM	LTL	Up to 71.6% ACC (two-stage), 14% FP (Li et al., 2 Apr 2025)	Annotation-then-conversion, empirical dataset
API Inference	RESTSpecIT	OpenAPI	85.05% route recall, 81.05% param recall, <\$0.01/API (Decrop et al., 2024)	Masking loop, black-box validation

These results establish that LLM-based approaches can materially outperform vanilla LLM prompting (e.g., GPT-4o zero-shot) and compete with or surpass manually guided baselines.

6. Limitations, Error Modes, and Future Directions

Known limitations include:

Oversimplification: LLMs tend to collapse complex temporal/logical structure into simple boundary-check constraints or fabricate spurious conditions in under-specified domains (Li et al., 2 Apr 2025).
Hallucination: Unverifiable or “invented” specifications can arise, especially in single-stage or end-to-end settings, with rates up to 24.3% false positives in LTL extraction (Li et al., 2 Apr 2025).
Modality/context window limitations: Long, highly graphical specs may exceed prompt processing limits; hybrid pre-filtering or document chunking partially alleviate this.
Domain transfer: Most existing experiments focus on C, Java, and hardware; generalizing to richer specification languages (e.g., separation logic or security SLAs) and multi-language codebases remains ongoing work (Chen et al., 12 Sep 2025).

Moving forward, active research areas include retrieval-augmented context, system-level feedback loops, fine-tuning for domain transfer, multi-stage neuro-symbolic pipelines, and richer intra-prompt fact-checking or consistency validation mechanisms.

7. Theoretical and Methodological Foundations

Formally, the LLM-based specification inference problem can be expressed as synthesis of a function $S: \mathcal{C} \to \Sigma$ mapping code artifacts $\mathcal{C}$ (possibly multimodal) to a space of well-formed specifications $\Sigma$ , with correctness defined in terms of soundness and adequacy of pre/postconditions, temporal logic formulas, or assertion statements. Verification conditions are typically established via weakest precondition (software), model checking (hardware), or mutation analysis (both).

A plausible thesis is that LLMs, equipped with localized context, symbolic guides, and error-driven refinement, close the human-in-the-loop bottleneck in specification mining, but are not yet universally reliable in unconstrained, high-noise regimes. Integration with symbolic verifiers, neuro-symbolic feedback, and iterative correction pipelines remains essential for production-grade accuracy and functional soundness (Tian et al., 14 Jul 2025, Granberry et al., 29 Apr 2025, Teuber et al., 3 Feb 2025).

Markdown Upgrade to Chat

References (7)

AssertCoder: LLM-Based Assertion Generation via Multimodal Specification Extraction (2025)

Next Steps in LLM-Supported Java Verification (2025)

Seeking Specifications: The Case for Neuro-Symbolic Specification Synthesis (2025)

Specify What? Enhancing Neural Specification Synthesis by Symbolic Methods (2024)

SLD-Spec: Enhancement LLM-assisted Specification Generation for Complex Loop Functions via Program Slicing and Logical Deletion (2025)

Extracting Formal Specifications from Documents Using LLMs for Automated Testing (2025)

You Can REST Now: Automated Specification Inference and Black-Box Testing of RESTful APIs with Large Language Models (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LLM-Based Specification Inference.

LLM-Based Specification Inference

1. Modality-Aware and Semantic Preprocessing

2. Prompt Engineering and Chain-of-Thought-Driven Synthesis

3. Mutation-Based and Proof-Driven Evaluation Loops

4. Hybrid Neuro-Symbolic Specification Synthesis

5. Application Domains and Quantitative Performance

6. Limitations, Error Modes, and Future Directions

7. Theoretical and Methodological Foundations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

LLM-Based Specification Inference

1. Modality-Aware and Semantic Preprocessing

2. Prompt Engineering and Chain-of-Thought-Driven Synthesis

3. Mutation-Based and Proof-Driven Evaluation Loops

4. Hybrid Neuro-Symbolic Specification Synthesis

5. Application Domains and Quantitative Performance

6. Limitations, Error Modes, and Future Directions

7. Theoretical and Methodological Foundations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research