A Survey of Vibe Coding with Large Language Models

Published 14 Oct 2025 in cs.AI | (2510.12399v1)

Abstract: The advancement of LLMs has catalyzed a paradigm shift from code generation assistance to autonomous coding agents, enabling a novel development methodology termed "Vibe Coding" where developers validate AI-generated implementations through outcome observation rather than line-by-line code comprehension. Despite its transformative potential, the effectiveness of this emergent paradigm remains under-explored, with empirical evidence revealing unexpected productivity losses and fundamental challenges in human-AI collaboration. To address this gap, this survey provides the first comprehensive and systematic review of Vibe Coding with LLMs, establishing both theoretical foundations and practical frameworks for this transformative development approach. Drawing from systematic analysis of over 1000 research papers, we survey the entire vibe coding ecosystem, examining critical infrastructure components including LLMs for coding, LLM-based coding agent, development environment of coding agent, and feedback mechanisms. We first introduce Vibe Coding as a formal discipline by formalizing it through a Constrained Markov Decision Process that captures the dynamic triadic relationship among human developers, software projects, and coding agents. Building upon this theoretical foundation, we then synthesize existing practices into five distinct development models: Unconstrained Automation, Iterative Conversational Collaboration, Planning-Driven, Test-Driven, and Context-Enhanced Models, thus providing the first comprehensive taxonomy in this domain. Critically, our analysis reveals that successful Vibe Coding depends not merely on agent capabilities but on systematic context engineering, well-established development environments, and human-agent collaborative development models.

Abstract PDF Upgrade to Chat

Authors (15)

Summary

The paper formalizes Vibe Coding by modeling the collaboration of humans, coding agents, and projects as a constrained Markov Decision Process.
It synthesizes over 1000 research works to develop a detailed taxonomy covering data foundations, training strategies, and coding agent architectures.
It identifies open challenges in context engineering, security, and scalable oversight, setting a roadmap for future AI-driven software development.

A Survey of Vibe Coding with LLMs: Technical Foundations, Development Models, and Open Challenges

Introduction and Motivation

The surveyed paper provides a comprehensive and systematic review of "Vibe Coding," a paradigm shift in software engineering catalyzed by the rapid advancement of LLMs and autonomous coding agents. Vibe Coding is characterized by a triadic collaboration among human developers, coding agents, and software projects, where developers validate AI-generated implementations through outcome observation and iterative feedback, rather than traditional line-by-line code comprehension. This paradigm is formalized as a Constrained Markov Decision Process (MDP), capturing the dynamic interplay between human intent, project context, and agentic action.

The paper synthesizes over 1000 research works, establishing theoretical foundations, practical frameworks, and a taxonomy of development models for Vibe Coding. It further identifies critical challenges in context engineering, infrastructure, security, and human-AI collaboration, providing a technical roadmap for future research and deployment.

Figure 1: Vibe Coding is a triadic collaboration among developer, coding agent, and project, with iterative instruction–feedback loops enabling context-aware coding and result-oriented review.

Theoretical Formalization of Vibe Coding

Vibe Coding is defined as an engineering methodology where the human ( $\mathcal{H}$ ), project ( $\mathcal{P}$ ), and coding agent ( $\mathcal{A}_\theta$ ) interact in a closed-loop system. The human articulates requirements and evaluates outputs, the project provides the contextual information space, and the agent executes conditional code generation and modification. The collaboration is modeled as a Constrained MDP:

$\mathcal{V}_{\text{MDP}} = \langle \mathcal{S}_{\mathcal{P}}, \mathcal{A}_{\mathcal{H} \rightarrow \mathcal{A}_\theta}, \mathcal{T}_{\mathcal{A}_\theta | \mathcal{P}}, \mathcal{R}_{\mathcal{H}}, \gamma \rangle$

where the state space is defined by the project, actions are triggered by human instructions, transitions are constrained by project specifications, and rewards are determined by human evaluation.

The agent's code generation is contextually conditioned on human intent, project context, and execution environment, with the optimization objective being the orchestration of context to maximize alignment with human expectations under context window constraints.

Iterative task expansion is formalized, supporting dynamic requirement evolution and progressive constraint reinforcement, distinguishing Vibe Coding from traditional paradigms with frozen requirements.

Figure 2: The evolution timeline of Vibe Coding, showing the progression from foundational coding LLMs to sophisticated agent systems, execution environments, and feedback mechanisms.

LLMs for Coding: Data, Pre-training, and Post-training

The survey provides a detailed taxonomy of LLMs for code, covering data foundations, pre-training, and post-training:

Data Foundations: Training corpora are sourced from open repositories (e.g., GitHub, Stack Overflow), with both depth-focused (quality in popular languages) and breadth-focused (coverage across languages) strategies. Instruction datasets are constructed from real-world code, synthetic instructions, and preference data, with advanced pipelines for filtering, deduplication, and quality enhancement.
Pre-training: Objectives include masked language modeling, autoregressive modeling, denoising, structure-aware tasks (e.g., data flow prediction), and contrastive learning. Multimodal and cross-modal pre-training further enhance code understanding and generation.
Continual Pre-training: CPT enables domain adaptation while mitigating catastrophic forgetting via data replay and optimal data mixing.
Post-training: Supervised fine-tuning (SFT) and instruction tuning align models with human intent, with parameter-efficient methods (e.g., LoRA, adapters) reducing computational overhead. Reinforcement learning (RLHF, DPO, GRPO) and verifiable reward mechanisms further refine reasoning and correctness, especially in code generation.
Figure 3: The LLM training lifecycle for coding, encompassing data foundation, pre-training, and post-training, with core capabilities in code generation, understanding, bug detection, and optimization.

Coding Agent Architectures

The paper surveys the architecture of LLM-based coding agents, decomposing them into core components:

Planning and Decomposition: Agents employ chain-of-thought, tree-of-thought, and hierarchical planning to decompose complex tasks, leveraging both prompting and external planning modules.
Memory Mechanisms: Short-term memory is managed via context windows; long-term memory is implemented through external vector stores, memory banks, and dual-memory architectures, supporting retrieval, consolidation, and belief updating.
Action Execution: Tool integration enables agents to invoke compilers, debuggers, and external APIs. Code-based action frameworks unify diverse behaviors, with multi-agent and role-based systems supporting complex workflows.
Reflection and Debugging: Iterative refinement, self-reflection, and multi-agent critique loops enable agents to learn from errors, validate outputs, and autonomously debug code.
Collaboration: Multi-agent systems distribute roles (e.g., programmer, tester, critic), coordinate via shared message pools, and employ structured communication protocols for scalable, resilient development.
Figure 4: Coding agent architecture, highlighting cognitive system, memory, and tool integration, with planning, action execution, and multi-agent collaboration.

Development Environments for Coding Agents

The infrastructure for Vibe Coding is categorized into:

Isolated Execution Environments: Containerization (Docker, LXC), sandboxing, and hardware-assisted isolation ensure secure, reproducible code execution. Cloud-based platforms orchestrate large-scale, distributed agent workloads.
Interactive Development Interfaces: AI-native IDEs integrate inline completion, conversational agents, and contextual memory. Protocol standards (MCP, LSP, DAP) enable seamless tool integration and context exchange.
Distributed Orchestration Platforms: CI/CD pipelines, cloud compute orchestration, and multi-agent frameworks (AutoGen, CrewAI, MetaGPT) support scalable, automated, and collaborative agentic workflows.
Figure 5: Development environment architecture for coding agents, including isolated, interactive, and distributed platforms.

Feedback Mechanisms

Feedback is central to Vibe Coding, with a taxonomy spanning:

Compiler Feedback: Syntax/type error detection, static analysis, and runtime compilation feedback are integrated into agent workflows for iterative refinement.
Execution Feedback: Unit and integration test results, runtime error handling, and exception management provide objective signals for code validation and repair.
Human Feedback: Interactive requirement clarification, code review, and preference-based RLHF/DPO pipelines align agent outputs with human intent.
Self-Refinement: Agents employ self-critique, multi-agent collaborative feedback, and memory-based reflection to autonomously improve outputs.
Figure 6: Feedback loops for coding agents, including internal self-refinement and external feedback from compiler, execution, and human sources.

Taxonomy of Vibe Coding Development Models

The paper introduces a three-dimensional classification of Vibe Coding development models:

Unconstrained Automation Model (UAM): AI-dominated, minimal human review, suitable for rapid prototyping but high technical debt risk.
Iterative Conversational Collaboration Model (ICCM): Human-in-the-loop review and understanding, analogous to pair programming, balancing speed and quality.
Planning-Driven Model (PDM): Upfront architectural planning, AI implements under strict human-defined constraints, suitable for complex systems.
Test-Driven Model (TDM): Test-first development, AI generates code to pass predefined tests, providing objective quality assurance.
Context-Enhanced Model (CEM): Horizontal enhancement via retrieval-augmented generation and context engineering, improving codebase alignment and maintainability.

These models are composable, allowing practitioners to tailor workflows to project risk, speed, and governance requirements.

Figure 7: Comparison of Vibe Coding development models with traditional software engineering models.

Open Challenges and Future Directions

Process Reengineering

Vibe Coding compresses the software development lifecycle into continuous micro-iterations, blurring traditional SDLC phases. Developers' roles shift toward intent articulation, system-level debugging, context curation, and architectural oversight. Project management and collaboration require new metrics, review processes, and accountability frameworks.

Code Reliability and Security

The abstraction of code authoring introduces risks of subtle bugs and vulnerabilities. Manual review is inadequate; integrated, real-time SAST/DAST, sandboxed dynamic analysis, and AI-driven threat modeling must be embedded in the development loop. The human remains the final arbiter, but automation must provide actionable, immediate intelligence.

Scalable Oversight

As agents become more autonomous, scalable oversight architectures are required. Weak-to-strong supervision, multi-agent debate, and continuous monitoring are necessary to prevent cascading errors, dependency proliferation, and alignment failures. Formal verification, RL-based watchdogs, and automated provenance tracking are critical research directions.

Human Factors

Developers transition from code authors to context engineers and supervisors. Prompt engineering, task decomposition, quality supervision, and agent governance become core competencies. Team collaboration and trust calibration must adapt to the scale and pace of AI-generated code, with implications for education and organizational structure.

Conclusion

This survey establishes Vibe Coding as a principled discipline, formalizing its triadic system, synthesizing agentic and infrastructural advances, and providing a unifying taxonomy of development models. The central insight is that context engineering, feedback integration, and human-in-the-loop governance are as critical as model quality for reliable, scalable, and maintainable AI-augmented software engineering. The paper articulates actionable guidance and a research agenda spanning technical, security, and human factors, positioning Vibe Coding as a foundational paradigm for the future of software development.

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

Overview: What is this paper about?

This paper explains a new way of building software with AI called “Vibe Coding.” Instead of humans carefully reading every line of code, developers describe what they want in plain language, let an AI “coding agent” try it, watch what happens, and then give feedback. The paper surveys over 1000 research studies to describe how Vibe Coding works, what tools it needs, and what challenges it faces.

Key objectives and questions

The paper tries to answer simple but important questions:

What exactly is Vibe Coding, and how does it change how people build software?
How do AI coding agents work, and what do they need to be successful?
What kinds of development styles are people using with these agents?
Why do some teams get great results while others don’t?
What problems still need to be solved (like security, human-AI teamwork, and practical tools)?

How did the researchers study this?

The authors did a “survey,” which means they read and organized a huge number of papers to build a big-picture view of the field. They also created a simple model to explain how Vibe Coding works:

Imagine a video game:
- The human developer is the player who says the goal and the rules (what to build and what counts as “good”).
- The software project is the game world (the codebase, data, and documents).
- The AI agent is the character that makes moves (writes code, runs tests, fixes errors).
- The system keeps looping: the agent tries something, the developer watches the result, gives feedback, and the agent improves.

They call this setup a “Constrained Markov Decision Process.” In everyday terms, it’s a way to describe decision-making with rules and feedback, just like playing by the rules in a strategy game.

Main findings and why they matter

The paper organizes Vibe Coding into a clear “ecosystem” with four key parts. This helps people see the full picture, not just the AI model:

LLMs for coding: the brains that can write and understand code.
Coding agents: the AI workers that plan, remember, use tools, run code, and fix mistakes.
Development environments: safe places where agents can run code, talk to tools, and work with humans.
Feedback mechanisms: ways the system learns if it’s right (compiler errors, test results, human feedback, and self-checking).

The authors also found five common development models used in practice. These are like different playstyles depending on what the project needs:

Unconstrained Automation: “Let the agent do most of the work” with minimal human control.
Iterative Conversational Collaboration: “Talk it out”—humans and agents go back and forth step by step.
Planning-Driven: The agent makes a plan first, then executes it.
Test-Driven: Write tests up front, then use them to guide the agent’s coding and fixes.
Context-Enhanced: Carefully feed the agent the right background info (code, docs, examples) so it doesn’t get confused.

Why this matters:

Powerful agents alone aren’t enough. Success depends on clear instructions, the right context, good tools, and smart teamwork between humans and AI.
Surprisingly, some teams reported getting slower when using AI without good structure or clear prompts. This shows that “just add AI” isn’t a guaranteed productivity boost.
The field is moving fast, but it needs better security, safer environments, and human-centered design to be trustworthy and widely useful.

Implications and potential impact

If done well, Vibe Coding could:

Give solo developers “team-level” powers. An agent can set up servers, write tests, and fix bugs while the human focuses on ideas and quality.
Speed up development while keeping quality high, thanks to continuous testing and automated fixes.
Open software creation to more people. Non-programmers (like doctors, teachers, or designers) can describe what they want in plain language and guide the agent by checking results.

However, the paper also warns that teams need good practices:

Plan how humans and agents share work.
Use strong test suites and safe environments.
Invest in “context engineering” (feeding the agent the right parts of the project).
Build security and guardrails from the start.

In short, this survey gives both a map of the current landscape and a practical guide for using AI coding agents responsibly. It shows that the future of software isn’t just “AI writes code,” but “humans and AI work together”—with clear goals, smart feedback, and the right tools to make ideas real.

View Paper Prompt View All Prompts

Knowledge Gaps

Unresolved Knowledge Gaps, Limitations, and Open Questions

Below is a concise, actionable list of what remains missing, uncertain, or unexplored in the paper’s treatment of Vibe Coding with LLMs.

Formalization validity: The Constrained MDP formalization is not empirically validated; mapping real development artifacts to concrete state, action, transition, and reward definitions remains unspecified and needs experimental grounding across diverse projects.
Reward design: The paper does not define operational reward functions that capture multi-objective software quality (correctness, performance, security, maintainability); methods to combine or trade off these objectives are absent.
Context orchestration algorithms: While “optimal context engineering” is framed as an objective, no concrete retrieval, filtering, ranking, or budgeting algorithms (nor their complexity or approximation guarantees) are proposed or benchmarked for large repositories.
Theoretical guarantees: There are no convergence guarantees or bounds for the iterative human-agent loop (e.g., conditions under which refinement terminates, bounds on cycles, or regret analyses under constrained contexts).
Model-to-practice mapping: The triadic “human-project-agent” components lack standardized schemas for instrumentation and logging to enable reproducible measurements of interactions and outcomes in real environments.
Development model selection: The taxonomy of five development models is not accompanied by decision criteria (task characteristics, codebase size, risk profile) or comparative evidence that guides when to use each model.
Benchmarking gaps: There is no Vibe Coding-specific benchmark that captures requirement evolution, environment setup, long-horizon tasks, and outcome-driven validation beyond pass/fail unit tests (e.g., repository-level, multi-iteration workflows).
Metrics design: The survey does not propose standardized metrics for Vibe Coding (e.g., agent cycles to acceptance, human time vs. agent time, context-token budget, test stability, rollout reproducibility, maintainability indices).
Productivity and UX evidence: Human-factor claims (e.g., observed productivity losses) lack standardized protocols and measures (cognitive load, trust calibration, effort distribution, error oversight), and no controlled studies compare development models.
Feedback interplay: The relative effectiveness, sequencing, and weighting of compiler, execution, human, and self-refinement feedback are not studied; there is no framework for conflict resolution among heterogeneous signals.
Test-driven dynamics: The paper does not examine how test quality, flakiness, and coverage influence agent behavior; methods for LLM-assisted test generation that avoid “teaching to the test” remain under-specified.
Multi-agent coordination: Protocols for role assignment, consensus, conflict resolution, and token/cost budgeting in multi-agent coding remain unstandardized and untested at scale.
Memory safety and consistency: Agent memory mechanisms lack guidelines for preventing stale or contradictory context, ensuring provenance, and enforcing retention policies (privacy, compliance); no consistency models are proposed.
Tool reliability: Function calling and tool integration do not include systematic error classification, recovery strategies, or self-calibration protocols for unreliable external tools and APIs.
Environment drift: The survey does not address how agents detect and adapt to environment changes (dependencies, OS differences, CI/CD pipelines), nor standards for sandboxing and reproducibility across platforms.
Cost and energy efficiency: There is no analysis of latency, throughput, compute cost, or energy footprint of agent workflows; cost-aware planning and caching strategies are not explored.
Security threat modeling: A comprehensive threat model (prompt injection, dependency supply chain attacks, exfiltration via tools, privilege escalation) and systematically evaluated defenses (isolation, policy enforcement, attestations) are missing.
Governance and accountability: The paper does not specify audit logging, review workflows, acceptance criteria, rollback mechanisms, or assignment of responsibility for agent-made changes in regulated contexts.
Licensing and compliance: Code license compatibility, attribution, and legal risks from training data and generated code (including synthetic rewriting) are not analyzed or operationalized in agent toolchains.
Silent failures and subtle bugs: Detection strategies for non-crashing, semantic bugs and performance regressions are unaddressed; risk scoring and triage methods for agent outputs remain open.
Domain generalization: The survey does not assess Vibe Coding performance in specialized domains (embedded, real-time, safety-critical, hardware description languages, GPU kernels) or propose domain adaptation protocols.
Maintainability trade-offs: Long-term code health implications of agent-generated code (style drift, technical debt, documentation gaps) are unmeasured; refactoring policies and maintainability metrics are absent.
Data quality auditing: Pipeline-level auditing for training corpora (duplication, contamination, license provenance, domain coverage) and for synthetic instruction/preference data is not standardized or evaluated.
Long-context strategies: No comparative study of retrieval vs. extended-context models for large repositories; chunking, linking, and code-aware segmentation strategies lack empirical guidance.
Error taxonomy: A standardized taxonomy for agent coding errors (specification misunderstanding, context misuse, tool misuse, environment misconfiguration) and associated mitigation playbooks is missing.
Human-in-the-loop protocols: Best practices for prompt structuring, context curation, and oversight cadence are not codified into repeatable operating procedures with empirical validation.
Outcome-based validation: The paper does not operationalize “result-oriented review” (what outcomes to measure, acceptable variance, validation pipelines) beyond unit tests.
Reproducibility controls: Methods to achieve deterministic agent outputs (seed control, environment snapshotting, fixed tool versions) and auditability are not proposed.
Lifecycle integration: CI/CD and DevOps integration patterns (gating policies, automated rollbacks, staged deploys with agent participation) are not addressed.
Educational scaffolding: Training curricula or onboarding frameworks for developers transitioning to Vibe Coding (skills, pitfalls, mental models) are not provided or evaluated.
Ethical use and misuse: Guardrails for misuse (mass code generation of insecure patterns, automated exploitation, plagiarism) and societal impacts are not analyzed with concrete mitigation strategies.
Evaluation transparency: Many referenced systems are reported via heterogeneous metrics; a transparent, unified evaluation protocol with open datasets, seeds, and logs is not established.
Cross-organizational adoption: Organizational readiness, policy changes, and socio-technical factors for adopting Vibe Coding at scale remain unexplored; case studies and longitudinal evidence are absent.

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

Below are concrete use cases that can be deployed today by leveraging the survey’s formalization (triadic human–project–agent CM-DP), the five development models (Unconstrained Automation, Iterative Conversational Collaboration, Planning-Driven, Test-Driven, Context-Enhanced), and the ecosystem of tools (e.g., OpenHands, AutoGen, MetaGPT, AgentCoder, TestGen-LLM, ProjectTest, Git Context Controller, AutoSafeCoder).

Software engineering: repository-level bug triage and patching
- What: Use coding agents to localize faults, propose patches, run tests, and open PRs on existing repos (SWE-bench-style tasks).
- Best-fit models: Test-Driven + Context-Enhanced (RAG over repo + automated tests).
- Tools/workflows: OpenHands, AutoCodeRover, Agentless; issue-to-PR pipelines with CI.
- Sector: Software/DevOps.
- Assumptions/dependencies: Adequate unit/integration tests; sandboxed execution; CI gatekeeping; secrets isolation; human-in-the-loop code review.
Continuous test generation and coverage boosting in CI/CD
- What: Auto-generate missing unit/property tests, regenerate flaky tests, and enforce failing-first red/green cycles.
- Best-fit models: Test-Driven.
- Tools/workflows: TestGen-LLM, ProjectTest, TypeTest, Execution feedback loops; GitHub Actions/GitLab CI.
- Sector: Software Quality Engineering.
- Assumptions/dependencies: Clear acceptance criteria; stable CI infrastructure; test data management; deterministic environments.
Context-aware code assistance via Git-aware retrieval
- What: RAG pipelines retrieve relevant files/issues/docs for precise code edits and refactors.
- Best-fit models: Context-Enhanced + Iterative Conversational.
- Tools/workflows: Git Context Controller, CodeRAG, MemoryBank; IDE integrations (LSP, Cursor/VS Code).
- Sector: Software development.
- Assumptions/dependencies: Accurate repo indexing; permissions + data governance; long-context LLMs or chunking strategies.
Planning-driven scaffolding for new microservices
- What: Generate project skeletons, API contracts, tests, and deployment manifests from high-level specs.
- Best-fit models: Planning-Driven + Test-Driven.
- Tools/workflows: MetaGPT, ChatDev, CrewAI, TOSCA/TosKer for deployment descriptors.
- Sector: Software/Cloud/DevOps.
- Assumptions/dependencies: Organizational templates/standards (linting, security baselines); IaC and container runtime availability.
Conversational pair programming in IDEs (agent-in-the-loop)
- What: Iterative “vibe” loops to add features, debug, and refactor without line-by-line human reading.
- Best-fit models: Iterative Conversational Collaboration.
- Tools/workflows: LSP-based IDEs, Cursor, code review bots; Self-Refine/Reflexion loops.
- Sector: Software development; Education.
- Assumptions/dependencies: Logging/audit of agent actions; coding conventions; code ownership policies.
Automated code review and documentation drift repair
- What: Multi-agent reviewer/commenter that enforces checklists, generates review suggestions, and updates docs to match code.
- Best-fit models: Planning-Driven + Context-Enhanced.
- Tools/workflows: AutoGen multi-agent patterns; Doc2Agent; PR comment bots; API diffs with RAIT/Seeker.
- Sector: Software engineering.
- Assumptions/dependencies: Review guidelines as prompts/policies; repo-level RAG; protected branches and required checks.
API migration and large-scale refactoring assistance
- What: Identify deprecated APIs, generate safe replacements, and run/refine tests.
- Best-fit models: Planning-Driven + Test-Driven.
- Tools/workflows: RAIT, Code search + spectrum-based localization; automated PR batching by subsystem.
- Sector: Software modernization.
- Assumptions/dependencies: Up-to-date dependency graph; reliable test harness; staged rollouts.
DevSecOps: security scanning with auto-remediation proposals
- What: Integrate static/dynamic scanning and propose patches; run in isolated sandboxes before PRs.
- Best-fit models: Test-Driven + Context-Enhanced.
- Tools/workflows: AutoSafeCoder, Secure SDLC checklists; isolated runtimes; policy-as-code gates.
- Sector: Security/Compliance.
- Assumptions/dependencies: Vulnerability feeds; SBOM availability; secrets handling; human approval for merges.
On-call runbooks as executable agent workflows
- What: Incident bots that parse alerts, run diagnostics, apply safe fixes, and file postmortems.
- Best-fit models: Planning-Driven + Iterative Conversational.
- Tools/workflows: OpenHands for terminal actions; MCP/Toolformer-style function calling; structured SOP prompts.
- Sector: SRE/Operations.
- Assumptions/dependencies: Least-privilege credentials; rollback mechanisms; guardrail policies; audit trails.
Performance tuning suggestions from execution feedback
- What: Agents profile hot paths, propose optimizations, generate microbenchmarks, and validate speedups.
- Best-fit models: Execution Feedback + Test-Driven.
- Tools/workflows: PerfCodeGen; benchmark harnesses; A/B in CI; canary deployments.
- Sector: Software performance engineering.
- Assumptions/dependencies: Representative workloads; safe performance counters; no regression to correctness.
Data and analytics: auto-validated SQL and data checks
- What: Generate SQL/ETL with schema-aware validation and tests; detect data quality issues.
- Best-fit models: Test-Driven + Context-Enhanced.
- Tools/workflows: SQLucid; type/schema retrieval; dbt tests; CI on data pipelines.
- Sector: Analytics/BI/Finance/Healthcare.
- Assumptions/dependencies: Reliable schema metadata; non-production data sandboxes; governance for PHI/PII.
Education: agent-supported labs and grading in sandboxes
- What: Students interact with agents to build/repair code; automated grading via execution-based benchmarks.
- Best-fit models: Iterative Conversational + Test-Driven.
- Tools/workflows: InterCode, SandboxEval, SWE-bench subsets; per-student isolated runtime.
- Sector: Education/EdTech.
- Assumptions/dependencies: Academic integrity policies; visible provenance of AI support; reproducible sandboxes.
Research: reproducible agent evaluations and ablations
- What: Run agent studies with standard environments, feedback channels, and metrics.
- Best-fit models: Any (as experimental variables).
- Tools/workflows: SWE-bench, SandboxEval; AutoGen orchestration; clear prompts/policies.
- Sector: Academia/ML systems.
- Assumptions/dependencies: Fixed datasets; versioned prompts; hardware budget for repeated runs.
Policy/compliance operations for AI-assisted coding
- What: Institute guardrails: sandboxing, action logging, code provenance tags, and license checks for training/eval data.
- Best-fit models: Test-Driven (policy-as-tests).
- Tools/workflows: The Stack-derived curation pipelines; commit provenance labeling; Secure SDLC gates.
- Sector: Policy/Compliance/Legal.
- Assumptions/dependencies: Organizational buy-in; legal guidance on IP/licensing; change management.
No-/low-code productivity for domain experts
- What: Build internal tools, forms, dashboards, or automation scripts with intent-first conversational workflows.
- Best-fit models: Iterative Conversational + Planning-Driven.
- Tools/workflows: ChatDev/MetaGPT starter templates; agent-scaffolded UIs/DB schemas; guided test harnesses.
- Sector: Line-of-business apps (marketing, HR, operations), Daily life.
- Assumptions/dependencies: Role-based access; guardrails to prevent data exfiltration; curated component libraries.

Long-Term Applications

These require further research, scaling, formal assurance, or organizational/policy development before broad deployment.

Autonomous repository evolution with minimal oversight
- What: Agents that monitor issues, plan roadmaps, implement features, and maintain tests.
- Best-fit models: Planning-Driven + Context-Enhanced + Test-Driven (combined).
- Tools/workflows: Multi-agent role systems (MetaGPT/CrewAI), long-term memory, scheduling/orchestration.
- Sector: Software engineering at scale.
- Assumptions/dependencies: Robust alignment, reliable long-context memory, strong safety/audit controls, mature evaluation metrics.
Safety-/mission-critical code synthesis with formal verification
- What: Integrate compilers, model checkers, and proof tools into vibe loops for provably correct code.
- Best-fit models: Test-Driven + Compiler/Execution Feedback.
- Tools/workflows: Verified toolchains; property-based specs; proof-carrying code pipelines.
- Sector: Healthcare devices, Automotive/Avionics, Energy/Industrial control.
- Assumptions/dependencies: Formal specs; certification pathways; liability frameworks; very low defect tolerance.
Org-scale multi-agent SDLC orchestration (from product to ops)
- What: Product-manager, architect, developer, QA, SecOps agents coordinating across portfolios.
- Best-fit models: Planning-Driven + Agent Collaboration.
- Tools/workflows: AutoGen/CrewAI/MetaGPT with resource managers; enterprise tool mesh (MCP/ScaleMCP).
- Sector: Large enterprises, software platforms.
- Assumptions/dependencies: Interoperable tool APIs; governance of cross-agent communication; programmatic budgets/quotas.
Self-improving agents learning from production telemetry
- What: Agents that mine logs and user feedback to auto-generate tests and patches, closing quality gaps continuously.
- Best-fit models: Self-Refinement + Execution Feedback.
- Tools/workflows: Reflexion/Self-Refine-class loops; telemetry-to-test generation; rollout safeties.
- Sector: Consumer SaaS, Mobile, Web platforms.
- Assumptions/dependencies: Privacy-preserving data capture; drift detection; safe canaries; rollback strategies.
Massive cross-repo refactoring and dependency upgrades
- What: Consistent API and policy changes across thousands of services with automated validation.
- Best-fit models: Planning-Driven + Test-Driven.
- Tools/workflows: Hierarchical planning; staged migration waves; batch PR creation and merge queues.
- Sector: Big tech monorepos/microservices estates.
- Assumptions/dependencies: Uniform test baselines; dependency graphs; change risk modeling.
Sector-specific agentic SDLCs with regulatory constraints
- What: Finance risk-model code, healthcare EHR integrations, energy SCADA adapters built with compliance-first agents.
- Best-fit models: Context-Enhanced (domain knowledge) + Test-Driven.
- Tools/workflows: Domain RAG over standards (HIPAA, SOX); test policies as constraints; audit-by-design.
- Sector: Finance, Healthcare, Energy.
- Assumptions/dependencies: Domain corpora; regulator-accepted evidences; compliance automation.
Standardized function-calling and tool ecosystems
- What: Interoperable tool schemas enabling portable agent workflows across vendors and edges.
- Best-fit models: Action Execution (function calling) + Planning-Driven.
- Tools/workflows: MCP/ScaleMCP-like standards; marketplace of validated tools.
- Sector: Software, Robotics, Enterprise automation.
- Assumptions/dependencies: Open standards; tool certification; security isolation by default.
Privacy-preserving on-device developer agents
- What: Tiny agents that match server capabilities locally for sensitive codebases.
- Best-fit models: Iterative Conversational + Context-Enhanced (local RAG).
- Tools/workflows: TinyAgent-class models; efficient indexing; secure enclaves.
- Sector: Defense, IP-sensitive industries; Daily life (local automation).
- Assumptions/dependencies: Hardware acceleration; efficient long-context; energy constraints.
Automated security patch management at ecosystem scale
- What: Continuous ingestion of CVEs, impact analysis, patch authoring, and staged rollout across fleets.
- Best-fit models: Planning-Driven + Test-Driven + Execution Feedback.
- Tools/workflows: Vulnerability C2; simulated exploit tests; compliance gates; enterprise orchestration.
- Sector: Security/Platform engineering.
- Assumptions/dependencies: Legal approvals for automated changes; accurate impact modeling; emergency brake controls.
Education at scale: curricula around vibe coding
- What: Studio-style courses where students manage agents; evaluation by execution and reflective reports.
- Best-fit models: Iterative Conversational + Test-Driven + Self-Refinement.
- Tools/workflows: Standardized sandboxes; benchmarks (SWE-bench); plagiarism-resistant assessment.
- Sector: Academia/EdTech.
- Assumptions/dependencies: Pedagogical standards; assessment validity; accessibility.
Policy and certification for AI-generated code
- What: Labels for AI-authored diffs, audit requirements, SBOM extensions, and certification regimes.
- Best-fit models: Test-Driven (policy-as-tests) integrated with CM-DP constraints.
- Tools/workflows: Provenance tags; audit logs of agent actions; license/compliance scanners.
- Sector: Policy/Regulatory.
- Assumptions/dependencies: Harmonized international standards; enforcement mechanisms; industry consortia.
Human-centered agent UX and governance dashboards
- What: Role-aware interfaces exposing intent, context, constraints, and reversible plans to reduce cognitive overhead.
- Best-fit models: Iterative Conversational + Planning-Driven.
- Tools/workflows: MultiMind/PairBuddy-style UIs; decision logs; capability scoping controls.
- Sector: Product/DevTools.
- Assumptions/dependencies: Usability research; explainability features; org policy alignment.
Verified multi-modal agent systems (code + UI + data)
- What: Agents that reason across code, GUIs, and datasets with formal guardrails and task proofs.
- Best-fit models: Planning-Driven + Execution/Compiler Feedback.
- Tools/workflows: Multimodal LLMs; UI automation + sandboxed OS agents; formal specs for cross-modal tasks.
- Sector: Robotics, Healthcare IT, Enterprise apps.
- Assumptions/dependencies: Robust multimodal reasoning; secure GUI automation; formal verification maturity.
Economic and workforce transition programs
- What: Reskilling for vibe coding workflows, new roles (agent wrangler, context engineer), and AI-driven SDLC practices.
- Best-fit models: Organizational adoption of Iterative Conversational + Test-Driven governance.
- Tools/workflows: Training curricula; competency frameworks; change management toolkits.
- Sector: Policy/Workforce development.
- Assumptions/dependencies: Public-private partnerships; funding; measurable outcome tracking.

In practice, the survey’s key insight—that success depends as much on context engineering, robust environments, and human-agent collaboration models as on raw model capability—implies that even “Immediate Applications” should be implemented with explicit constraints, sandboxes, tests, and governance. “Long-Term Applications” become feasible as foundations mature: longer/more reliable context, standardized tool ecosystems, formal methods integration, and clear regulatory frameworks.

View Paper Prompt View All Prompts

Glossary

Alignment (outer and inner alignment): Research and methods to ensure model behavior matches intended objectives; outer alignment concerns specified goals, inner alignment concerns the model’s learned optimization. "Alignment research categorizes methods into outer and inner alignment with adversarial considerations, while exploring training-free alignment and personalized alignment techniques"
Auto-regressive generation: A sequence modeling approach where each token is generated conditioned on previously generated tokens. "the Agent generates code sequence Y = (y_1, \ldots, y_T) in an auto-regressive manner"
Chain-of-Thought (CoT): A prompting technique that encourages models to produce step-by-step reasoning to improve problem solving. "Chain-of-Thought (CoT) reasoning has proven particularly effective"
Constrained Markov Decision Process (Constrained MDP): A decision-making framework like an MDP but with explicit constraints on policies or costs. "a Constrained Markov Decision Process"
Context engineering: The systematic construction, retrieval, filtering, and ranking of contextual information to optimize LLM outputs. "Effective human-AI collaboration demands systematic prompt engineering and context engineering"
Direct Preference Optimization (DPO): An RL-free method that optimizes models directly from pairwise preference data to align outputs with desired choices. "DPO emerges as an reinforcement learning RL-free alternative to RLHF"
Edge deployment: Running models or agents on local or edge devices to reduce latency and reliance on centralized infrastructure. "enable edge deployment with compact models matching large model capabilities locally"
Execution feedback: Information from running code (e.g., tests, runtime logs) used to guide refinement or learning. "execution feedback obtained from running o_k in environment \mathcal{E}"
Fill-in-the-middle objectives: Training objectives where the model infers a missing middle segment of code given surrounding context. "with fill-in-the-middle objectives and PII redaction"
Function calling: A mechanism for LLMs to invoke external tools or APIs via structured calls. "Function calling frameworks teach LLMs to self-supervise tool use with simple APIs requiring minimal demonstrations"
Group Relative Policy Optimization (GRPO): A reinforcement learning method that optimizes policies using relative advantages computed across groups. "Group Relative Policy Optimization with compiler feedback to achieve competitive performance"
In-context learning: Using prompts and examples at inference time to adapt model behavior without updating parameters. "Prompt engineering and in-context learning have emerged as fundamental techniques"
Instruction tuning: Fine-tuning models on instruction-following datasets to improve adherence to user directives. "Instruction tuning and supervised fine-tuning methodologies are reviewed covering dataset construction and training strategies"
Language Server Protocol (LSP): A standardized protocol that connects editors/IDEs to language-specific tooling (e.g., completion, diagnostics). "Language server protocol"
Low-Rank Adaptation (LoRA): A parameter-efficient fine-tuning technique that adds low-rank matrices to adapt large models with minimal additional parameters. "Low-Rank Adaptation (LoRA)"
Monte Carlo Tree Search (MCTS): A search algorithm that explores decision trees via random sampling to guide planning. "integrate Monte Carlo Tree Search with external feedback for deliberate problem-solving"
Multi-agent systems: Architectures where multiple agents coordinate and communicate to solve complex tasks collaboratively. "Multi-agent systems are examined covering agent profiling, communication protocols, and collaborative workflows across complex task-solving scenarios"
Parameter-efficient methods: Techniques to adapt large models using a small number of additional parameters (e.g., LoRA, adapters). "parameter-efficient methods including Low-Rank Adaptation (LoRA) and adapters"
PII redaction: The removal or masking of personally identifiable information from datasets to protect privacy. "PII redaction"
Proximal Policy Optimization (PPO): A popular policy gradient RL algorithm that stabilizes training via clipped objectives. "Execution-based methods leverage PPO with compiler feedback for real-time refinement"
Repository-level pretraining: Pretraining on entire code repositories (including issues and documentation) to improve long-context code understanding. "Foundation models for code employ repository-level pretraining with extended context windows"
Retrieval-Augmented Generation (RAG): Augmenting generation by retrieving relevant external knowledge or context at inference time. "tool use with Retrieval-Augmented Generation (RAG) and feedback learning"
Reinforcement Learning from AI Feedback (RLAIF): Alignment via RL that uses feedback provided by AI models rather than humans. "Reinforcement Learning from Human Feedback (RLHF), Reinforcement Learning from AI Feedback (RLAIF), and Direct Preference Optimization (DPO)"
Reinforcement Learning from Human Feedback (RLHF): Alignment via RL using human preference signals to steer model outputs. "Reinforcement Learning from Human Feedback (RLHF)"
Spectrum-based fault localization: Debugging technique that ranks code elements by their correlation with failing versus passing tests. "spectrum-based fault localization"
Test-driven development (TDD): A methodology where tests are written before code, guiding implementation through incremental passes. "investigate test-driven development principles"
Triadic relationship: The three-way interaction framework among human developers, software projects, and coding agents underpinning Vibe Coding. "a dynamic triadic relationship among human developers, software projects, and Coding Agents"
Unit test feedback: Signals from unit test execution used to improve code generation or RL training. "using unit test feedback"
Vibe Coding: A development paradigm where developers validate AI-generated implementations via outcome observation rather than line-by-line code review. "we define Vibe Coding as an engineering methodology for software development grounded in LLMs"
Zero-shot-CoT: Prompting that elicits chain-of-thought reasoning without providing few-shot exemplars. "Zero-shot-CoT"

View Paper Prompt View All Prompts

Open Problems

We found no open problems mentioned in this paper.

Continue Learning

Collections

Tweets

A Survey of Vibe Coding with Large Language Models

Summary

A Survey of Vibe Coding with LLMs: Technical Foundations, Development Models, and Open Challenges

Introduction and Motivation

Theoretical Formalization of Vibe Coding

LLMs for Coding: Data, Pre-training, and Post-training

Coding Agent Architectures

Development Environments for Coding Agents

Feedback Mechanisms

Taxonomy of Vibe Coding Development Models

Open Challenges and Future Directions

Process Reengineering

Code Reliability and Security

Scalable Oversight

Human Factors

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Overview: What is this paper about?

Key objectives and questions

How did the researchers study this?

Main findings and why they matter

Implications and potential impact

Knowledge Gaps

Unresolved Knowledge Gaps, Limitations, and Open Questions

Practical Applications

Immediate Applications

Long-Term Applications

Glossary

Open Problems

Continue Learning

Collections

Tweets

YouTube

HackerNews

alphaXiv