LLMs in Requirements Engineering
- LLMs in Requirements Engineering are transformer-based models that automate activities like elicitation, traceability, and specification, reducing manual workload.
- Techniques such as prompt engineering, multi-agent strategies, and fine-tuning enable these models to generate high-quality, compliant requirement artifacts.
- Empirical studies indicate LLMs drastically cut drafting time, improve document completeness, and optimize compliance checks in RE processes.
LLMs in Requirements Engineering are transforming how stakeholders elicit, analyze, specify, verify, and maintain software requirements. LLMs are pre-trained neural architectures—typically transformer-based—capable of sophisticated natural language understanding and generation. Their integration into requirements engineering processes leverages large-scale data-driven learning to automate or augment many tasks that previously required laborious manual analysis. LLMs are now being systematically adapted through specialized guidelines, instruction tuning, agent-based collaboration, and prompt engineering, addressing both generic and domain-specific requirements activities. Their contribution spans drafting, structural analysis, traceability, prioritization, specification formalization, compliance checking, qualitative data analysis, and education, with the field converging on a hybrid paradigm where LLMs accelerate and standardize routine work while human experts ensure technical depth and contextual alignment.
1. Foundational Roles and Architectural Selection
LLMs support a spectrum of requirements engineering (RE) activities, including classification, completion, traceability, and specification generation. They overcome limitations of traditional NLP by requiring fewer labeled examples, offering pre-trained contextual embeddings that can be rapidly repurposed for task-specific RE applications (Vogelsang et al., 21 Feb 2024). Model selection is informed by two principal criteria:
- Task type: “Understanding” tasks (e.g., classification, traceability) are best addressed by encoder-only models (such as BERT), especially when labeled data is available. “Generation” tasks (e.g., summarization, test case generation) favor decoder-only models (such as GPT) or encoder-decoder models (like T5).
- Operational Mode: The choice between autonomous LLM-led automation and support roles dictates data quality and ground-truth requirements. Encoder-only models typically excel when reliable annotated data is accessible; decoder-oriented models via API are used when labeled data is scarce or privacy constraints restrict corpus size.
Fine-tuning large models for RE is performed through:
- Repurposing via output layers (e.g., dense heads added to BERT).
- Domain adaptation on unlabeled RE corpora.
- Supervised full fine-tuning (jointly updating output layers and select transformer blocks).
- Advanced prompt engineering and Retrieval Augmented Generation (RAG) for decoder-only models. Regularization and early stopping are standard practices to avoid overfitting (Vogelsang et al., 21 Feb 2024).
2. LLM-Driven Requirements Artifacts and Quality Assurance
LLMs have demonstrated the capacity to automatically generate requirements artifacts that approach or, in specific use cases, exceed the output quality of entry-level engineers. GPT-4 and CodeLlama were found to produce Software Requirements Specification (SRS) documents with high completeness, consistency, correctness, clarity, feasibility, traceability, modularity, and compliance, scoring on par with human benchmarks across a composite metric: where is the score per evaluation criterion (Krishna et al., 27 Apr 2024).
Moreover, LLMs can identify problems within existing requirements (such as inconsistency, ambiguity, or missing elements) and iterate corrective feedback cycles—particularly with GPT-4, which outperformed CodeLlama for rectification tasks. Time efficiency improvements are striking; LLMs have reduced SRS drafting times by approximately 60–70% relative to entry-level engineers, with error detection and document refinement cycles correspondingly accelerated (Krishna et al., 27 Apr 2024).
In requirements quality assurance under ISO 29148, LLMs such as Llama 2 (70B) provide binary evaluations per quality characteristic (e.g., completeness, singularity, verifiability), generate rationales for their judgments, and propose actionable improvements. These outputs undergo human review, and while initial LLM precision may be low, recall is very high; a combined assessment workflow leveraging both independent and “bound” (LLM-aware) evaluation phases led to stronger reviewer agreement (Lubos et al., 20 Aug 2024).
3. Elicitation, Specification, and Prioritization Methodologies
LLMs are directly compared with human experts in elicitation tasks: in controlled studies, ChatGPT-4 produced requirements rated +1.12 higher for alignment with stakeholder intent and trended 10.2% more complete, with generation speeds 720x faster and costs at just 0.06% of human output (Hymel et al., 31 Jan 2025). Despite this, end-users tend to attribute higher alignment to “human” authorship, underscoring the need for hybrid workflows.
For specification, LLMs can automate the creation of detailed artifacts such as user stories, epics, and full Functional Design Specification (FDS) documents by ingesting summarized inputs and company-conformant templates (Pasquale et al., 25 Jul 2025). Empirical assessment against analyst-curated specifications shows that while LLMs excel in producing structurally consistent, high-quality drafts, final outputs still require expert review to ensure technical completeness and contextual specificity, especially when source knowledge is fragmented or omitted during summarization.
Prioritization is another key domain: web-based tools driven by LLMs implement established methodologies such as the Analytic Hierarchy Process (AHP), MoSCoW, and the “100 Dollar Test,” converting free-text inputs into formalized user stories and ranked requirement backlogs (Sami et al., 5 Apr 2024). Prompt engineering ensures the mapping from natural language to actionable and prioritized artifacts.
4. Specialized and Advanced LLM Applications in RE
LLMs are increasingly employed in sophisticated and domain-adapted RE tasks:
- Safety and Compliance: In autonomous driving, task-specific LLM pipelines automate portions of hazard analysis and risk assessment (HARA), decomposing item definitions into scenarios, malfunctions, hazardous events, and safety goals, with subsequent redundacy and consistency checks enforced via tailored prompts and rule-based post-processing (Nouri et al., 24 Mar 2024). Industry integration showed HARA cycles reduced from months to one day, with expert teams maintaining verification oversight.
- Non-Functional Requirements (NFR) Generation: Fine-grained frameworks prompt LLMs to infer ISO/IEC 25010:2023-aligned NFRs from functional requirements. Evaluation by industry experts found median validity and applicability scores of 5.0/5, with 80.4% exact attribute classification agreement and high model-dependent performance (Almonte et al., 19 Mar 2025).
- Formalization: LLMs assist in translating ambiguous natural language requirements into formal specifications. Surveyed approaches include prompt-only methods, chain-of-thought-augmented few-shot prompts, fine-tuned neuro-symbolic pipelines (LLM plus SMT solver), and collaborative human-in-the-loop refinement. Assertion generation for domains like Dafny yields better results than full contract synthesis for languages such as Java Modeling Language; verification-in-loop methods drive further refinement (Beg et al., 13 Jun 2025).
5. Prompt Engineering, Multi-Agent Strategies, and Qualitative Data Analysis
Prompt engineering is both foundation and bottleneck in LLM-RE integration:
- Systematic reviews categorize guidelines into nine themes: context, persona, templates, disambiguation, reasoning, analysis, keywords, wording, and few-shot prompting (Ronanki et al., 4 Jul 2025). Chain-of-thought templates, contextual anchoring, and persona conditioning are particularly relevant for RE.
- Expert interviews underscore that context, role simulation, and stakeholder-focused prompt patterns yield improved creative exploration and disambiguation, but challenges persist in multi-faceted reasoning and avoiding overly narrow scope (Ronanki et al., 4 Jul 2025).
- Multi-Agent Debate (MAD) strategies institute agent role play—“debaters” adopt opposing classes of a requirement (functional vs. non-functional), and a “judge” selects the final classification. MAD has been shown to improve F1-scores (from 0.726 to 0.841 with one interaction round) versus single-pass LLMs, although at a cost of order-of-magnitude increases in token usage, time, and financial outlay (Oriol et al., 8 Jul 2025).
- Qualitative Data Analysis (QDA) for requirements tasks—using context-rich, few-shot GPT-4 or LLaMA-2—achieves substantial agreement with human analysts (Cohen’s ). Automated annotation promotes traceability, domain model class definition, and round-trip verification while reducing manual overhead (Shah et al., 27 Apr 2025).
6. Evaluation, Verification, and Failure Modes
Automated verification with LLMs yields promising but qualified results. For example, when provided with system specifications, advanced LLMs like GPT-4o and Claude 3.5 Sonnet achieved f1-scores of 79%–94% in matching requirements to system description (using: ), though still lagging behind formal rule-based benchmarks (Reinpold et al., 18 Nov 2024). Performance drops as the task and prompt complexity increase, and few-shot prompting consistently outperforms baseline and chain-of-thought strategies. Well-structured, concise system descriptions further boost accuracy.
However, systematic failures exist. When LLMs are prompted to judge code compliance against natural language requirements, more complex (“explain and fix”) prompting often induces over-correction bias, increasing false negatives—i.e., misidentifying correct implementations as defective. Requirement Conformance Recognition Rate (RCRR) is highest with direct, binary prompts: Proposed mitigation includes phase-separated contract extraction and behavior comparison prompting (Jin et al., 17 Aug 2025).
7. Practical Impact, Education, and Ongoing Challenges
LLMs are augmenting RE education by providing scalable support in tasks such as elicitation, documentation, and stakeholder analysis. Empirical studies indicate enhanced concept comprehension and task motivation, especially in individually guided settings. Concerns remain about academic integrity, prompt engineering proficiency, and overreliance, leading to recommendations for clear usage guidelines, reflection-based assignments, and further research on collaborative AI–human learning models (Guardado et al., 7 Sep 2025).
Across domains, recurring challenges and frontiers include:
- Domain specificity and adaptation to niche contexts;
- Integration with existing tools and CASE environments;
- Managing model bias and factual correctness in critical applications;
- Cost, scalability, and explainability, especially in multi-agent or debate-based constructs.
Future research is converging on: domain-specific prompt/tuning methodologies, hybrid neuro-symbolic frameworks combining LLMs and formal verification, dynamic retrieval and evidence augmentation, and systematic benchmarking on both realistic and adversarial RE scenarios.
In conclusion, LLMs are fundamentally reshaping requirements engineering by automating routine document drafting, enhancing traceability and prioritization, enabling more precise quality control, and supporting rapid iteration. Their real-world impact is tightly coupled to advances in prompt engineering, instruction tuning, and multi-agent debate, bridging gaps between creative human analysis and scalable, standardized automation. The evidence base robustly supports a synergistic model in which LLMs accelerate and structure the RE process, while experienced engineers provide domain expertise, contextual sensitivity, and oversight.