AI in Software Engineering

Updated 8 December 2025

AI in software engineering is the integration of machine learning and automation to enhance design, coding, testing, and maintenance processes, exemplified by prompt-driven code generation and autonomous agents.
Iterative decision frameworks and human-AI collaboration drive rapid artifact acceptance by balancing productivity gains with quality control and risk management.
Emerging challenges include ensuring code safety, maintaining explainability, and developing robust verification techniques as systems evolve towards autonomous, self-evolving architectures.

AI in software engineering (SE) encompasses a spectrum of technologies and practices in which machine learning, LLMs, and automated agents augment or automate the design, implementation, testing, and maintenance of software. The integration of AI transforms how software artifacts are conceived, constructed, and evolved, shifting traditional human-centered workflows toward more dynamic, exploratory, and collaborative processes. Key advances include prompt-driven code generation (“vibe coding”), multi-dimensional decision frameworks for artifact acceptance, autonomous agents, and post-hoc model recovery to mitigate the fragility and opacity of AI-generated systems. AI in SE is characterized by both unprecedented productivity gains and novel risks (e.g., hallucinations, security vulnerabilities, overtrust), demanding rigorous oversight and continuous adaptation of engineering practices (Garousi et al., 23 Jul 2025, Cito et al., 4 Nov 2025, Navneet et al., 15 Aug 2025, Hassan et al., 2024).

1. Process Models and Decision Frameworks in AI-Assisted Software Engineering

Contemporary AI-augmented SE workflows are best framed as iterative, micro-decision-driven processes rather than linear pipelines. A four-phase pragmatic model—prompt design, inspection, fallback, and refinement—captures real-world engagement with tools such as Copilot and ChatGPT (Garousi et al., 23 Jul 2025). The iterative loop (Prompt Design → Inspection → [Refine or Fallback] → Inspection) is characterized by continuous cycling until an artifact meets the cost–quality constraints imposed by project and team standards.

A quantitative two-dimensional decision framework supports artifact acceptance by plotting expected quality (Q) of the AI-generated output against the effort saved (E) relative to manual development. The acceptance region is governed by a threshold function, typified as $Q \geq \alpha E + \beta$ , where $\alpha$ encodes risk tolerance and $\beta$ is the minimum quality baseline. Practical application of this framework leads to rapid, explicit decisions: accept with minimal edits in the high-E/high-Q quadrant, reject or refine in the low-E/low-Q zone (Garousi et al., 23 Jul 2025).

Field studies in Türkiye and Azerbaijan demonstrate the model’s effectiveness. For example, AI-generated boilerplate code for REST controllers was directly accepted (high E, high Q), whereas business-rule–intensive artifacts often fell below acceptance thresholds, requiring manual fallback or prompt refinement. Complex modules are frequently decomposed (divide-and-conquer) such that AI can assist with isolated logic but human oversight retains control of architectural cohesion (Garousi et al., 23 Jul 2025).

2. Human–AI Collaboration: Roles, Oversight, and Adoption Dynamics

AI-driven SE is not monolithic; developers assign divergent roles to AI—a spectrum that empirically splits into “tool” (inanimate utility, demanding high determinism and zero hallucination) and “teammate” (collaborator, assistant, or expert with allowance for imperfection and iterative dialogue) (Zakharov et al., 29 Apr 2025). These mental models directly affect tolerance for error, acceptance behavior, and adoption rates. Empirical analysis shows a positive association between conceptualizing AI in multiple roles and both perceived usefulness and ease of use.

Effective collaboration demands explicit allocation of roles and responsibilities. Workshops show that clear specification of AI “persona” (e.g., “Act as Senior Python Developer”), iterative prompt refinement, and the maintenance of a communal prompt library foster efficiency and shared learning. However, human oversight is indispensable: for complex algorithms, ambiguous requirements, and especially for security-critical code, peer review and validation must remain mandatory (Hamza et al., 2023).

Adoption and integration of AI in SE are driven more by compatibility with existing workflows than by perceived usefulness or social influence. Empirical models (HACAF) show that seamless integration—custom APIs, IDE plug-ins, and transparent human-in-the-loop controls—overrides generic efficiency gains in motivating developers to embrace AI4SE tools (Russo, 2023).

3. Taxonomies and Risk: Dimensions of AI Application

A systematic taxonomy such as AI-SEAL classifies AI-SE applications along three orthogonal axes: Point of Application (process, product, runtime), Type of AI (symbolist, connectionist, evolutionary, Bayesian, analogizer), and Level of Automation (ranging from purely assistive to fully autonomous) (Feldt et al., 2018). Risk is a monotonic function of both application point and autonomy. Process-level, low-automation uses (suggestions, analytics) pose minimal risk, while runtime implementations with high autonomy (self-modifying code, live AI agents) entail maximal risk and require extensive testing, monitoring, and real-time oversight.

Best practices stipulate gradual escalation of automation and runtime application only after establishing confidence at lower-risk quadrants. In high-stakes domains, organizations are advised to favor interpretable AI mechanisms and deploy continuous post-mortem analysis, real-time logging, and rollback capabilities to mitigate the inherent opacity and unpredictability of advanced AI systems.

4. Challenges: Quality, Safety, Explainability, and Verification

AI-assisted SE yields both productivity gains and a spectrum of technical and organizational risks. Key challenges include vulnerability inheritance from training data, hallucinated or fabricated code, overtrust, prompt misinterpretation, and irreversibility of destructive actions. The SAFE-AI framework proposes countermeasures: (1) Safety via guardrails and sandboxing, (2) Auditability through immutable logs and deviation metrics, (3) Feedback using in-IDE reporting and prompt versioning, and (4) Explainability through model-native explanations and human-in-the-loop gates (Navneet et al., 15 Aug 2025).

Autonomy–reversibility taxonomies further classify behaviors from suggestive (code completions, fully reversible) to destructive (irreversible file operations), guiding when stringent oversight must override AI suggestions. Formal code verification through proof tools (e.g., AutoProof, Dafny) is rapidly becoming essential as conventional test metrics fail to guarantee correctness in large-N modular systems: theoretical modeling shows that system correctness probability drops exponentially as $p_{\mathrm{system}} = p^N$ , which is prohibitive in large modern codebases (Meyer, 28 Nov 2025).

A hybrid process of “vibe-contracting” and “vibe-coding”—AI-assisted specification plus formal proof-guided code refinement—effectively closes the gap from “probably correct” to “provably correct” software. AI cycles back and forth between proposing specification/code and responding to failed proof obligations until all verification conditions clear, containing hallucinations and aligning outputs with strict semantics (Meyer, 28 Nov 2025).

5. The Impact of Generative AI: Paradigm Shifts and Model Recovery

The rise of generative AI redefines SE by collapsing the boundary between prototype and production code (“vibe coding”) (Cito et al., 4 Nov 2025). This democratization enables rapid creation of SaaS and end-user applications from natural-language prompts, but also yields fragile, non-robust, and opaque systems that lack explicit software models. Key pathologies include untested edge behavior, frequent security vulnerabilities (e.g., missing authentication, data leakage), and severe maintainability deficits due to the absence of architectural scaffolds (e.g., class diagrams, state machines).

Model recovery strategies address these deficiencies: post-hoc extraction of structural and behavioral models from code (mapping $f$ : Code $\to$ Model) enables quantitative risk assessment ( $R(M) = \sum_i w_i v_i(M)$ ), targeted refinement operations ( $\mathcal{R}(M, C) = C'$ ), and long-term co-evolution of code and documentation. Models serve as mediators between human intent, AI-originated source artifacts, and future system modification, restoring transparency and sustainability to the AI-driven SE lifecycle (Cito et al., 4 Nov 2025).

6. Towards Agentic and Self-Evolving Software

Agentic AI marks a transition from prompt-driven autocompletion to autonomous, context-aware software agents that maintain internal program representations, perform multi-stage reasoning about developer intent, and invoke analysis tools in micro-decision loops (Roychoudhury, 24 Aug 2025). Core to this capability is specification inference, operationalized as $f(I, C, n) = S_n$ , where an agent infers the desired behavior for a node $n$ in the codebase $C$ from an issue $I$ .

AI-driven self-evolving software architectures prototype this paradigm: multiple agents coordinate to interpret natural-language requirements, generate implementations, validate candidate code through cross-execution and majority-vote, and integrate or repair as needed. Case studies show such systems independently handle feature accretion, aggregation, stateful operations, and complex file manipulations with no human intervention. Challenges remain in scaling to large systems, enforcing strong type and specification constraints, and extending to proactive evolution (e.g., autonomous refactoring, performance optimization) (Cai et al., 1 Oct 2025).

7. Future Directions: Open Problems and Research Opportunities

While AI-augmented SE is accelerating, open challenges are substantial. Prominent research directions include (a) scaling intent inference models and agentic micro-decision reasoning to open-ended and ambiguous requirements, (b) advancing hybrid verification that fuses statistical monitors with formal proof obligations, (c) developing semantic and intent-aware guardrails, and (d) establishing proactive governance and feedback dashboards tracking aggregated risk and auditability (Navneet et al., 15 Aug 2025, Roychoudhury, 24 Aug 2025, Cito et al., 4 Nov 2025). Self-evolving and agentic software systems demand benchmarks for fully autonomous end-to-end evolution, formal V&V methods for AI-generated code, and trust calibration for human–AI co-development.

The field is shifting towards a genuinely symbiotic human–AI partnership, underpinned by models and frameworks that maximize productivity, transparency, and robustness while containing risk and preserving strategic human judgment. The ultimate horizon is SE 3.0: intent-first, conversation-oriented, multi-objective optimized, and SLA-aware development where AI is not only an assistant but a proactive, contextually aware software engineer (Hassan et al., 2024).