- The paper demonstrates that LLM-generated vibe coding poses significant safety risks, with high hallucination and silent failure rates affecting regulatory compliance.
- A rigorous methodology using persona-driven prompts and sandbox evaluations revealed a stark contrast between syntactic execution and semantic, safety-critical reliability.
- Findings underscore the need for robust, deterministic safety frameworks to integrate LLM outputs with traditional, formally verified systems.
Empirical Assessment of Vibe Coding for Construction Safety: Reliability, Risks, and Implications
Introduction
The paper "Is Vibe Coding the Future? An Empirical Assessment of LLM Generated Codes for Construction Safety" (2604.12311) investigates the efficacy and safety of vibe coding—a paradigm where non-technical users employ LLMs to generate executable code using natural language prompts. In the construction sector, this approach empowers on-site personnel to rapidly develop custom safety tools. However, the delegation of code generation to probabilistic LLMs raises profound concerns regarding logic fidelity, regulatory compliance, and the risk of silent failures in safety-critical applications.
Methodology
The study adopts a rigorous, three-phase empirical framework: (1) persona-driven prompt dataset creation, (2) code generation via frontier LLMs, and (3) execution and comprehensive evaluation in an isolated sandbox. Specifically, 150 prompts, designed to emulate three distinct construction personas (Safety Manager, Foreman, Worker), were used to generate 450 Python scripts from three consumer-accessible LLMs: Claude 3.5 Haiku, GPT-4o-Mini, and Gemini 2.5 Flash. Strict non-interactive architectural constraints were enforced, requiring headless backend scripts with hardcoded variables. Performance was evaluated both dynamically (sandbox execution) and statically (LLM-as-a-Judge against OSHA regulations), validated by human raters (Fleiss' Kappa = 0.84).
Results
Persona-Driven Hallucination Dynamics
The analysis demonstrates a statistically significant influence of user persona on LLM propensity to hallucinate safety-critical variables. Hallucination rates were lowest for the Safety Manager persona (64.70%), but escalated markedly for Foreman (82.00%) and Worker personas (87.30%). Odds ratios confirm that informal, ambiguous prompting substantially increases the likelihood of LLMs inventing missing safety data—a critical risk factor in safety engineering.
Architectural Robustness: Syntactic Reliability vs Defensive Design
Foundational code execution viability was high (~85%), with GPT-4o-Mini and Gemini 2.5 Flash successfully executing over 93% of scripts. Claude 3.5 Haiku lagged considerably at 68%. Notably, adherence to headless architecture (input()-free) was nearly perfect. However, defensive programming indicators were universally deficient: error handling mechanisms appeared in only 16% of scripts, with GPT-4o-Mini including them in a mere 3.3%. Models that excelled at syntactic correctness conspicuously failed to default to robust architectures.
OSHA Logic Fidelity and Silent Failures
OSHA logic fidelity was alarmingly poor; only 46.7% of scripts conformed to regulatory standards. More critically, silent failures were prevalent across executable scripts: overall, 45.3% contained mathematically incorrect safety calculations, silently violating OSHA requirements. GPT-4o-Mini exhibited the highest Silent Failure Rate (SFR) at 56.3%, Gemini 2.5 Flash at 41.4%, and Claude 3.5 Haiku at 35.3%. The prevalence of contextless outputs persisted: 31.56% of scripts failed to provide actionable directives, instead requiring end-users to interpret raw numerical results, further compounding risk.
Theoretical and Practical Implications
This empirical audit exposes a fundamental disconnect between syntactic and semantic code reliability in LLM-driven vibe coding. High execution rates actively mask hazardous mathematical errors, supporting the need to distinguish between overt crashes and silent regulatory violations. The statistical relationship between user context and hallucination underscores that LLM outputs inherit and amplify communication deficits inherent in informal, high-stress environments—prompt engineering alone cannot mitigate this.
From a practical perspective, unconstrained vibe coding is incompatible with the deterministic, zero-tolerance ethos of construction safety. Raw LLM outputs, regardless of apparent functionality, should not be trusted for stand-alone calculation or compliance-critical applications. The lack of actionable guidance in outputs increases cognitive load and error risk for non-technical personnel. Organizational strategies must prioritize layered, deterministic architectures, restricting LLMs to natural language interfaces and precluding them from autonomous logic engineering. Domain-specific regulatory logic must be executed exclusively by pre-audited, formally verified engines.
Future Directions
The study's findings delineate a clear research agenda: ongoing model assessment under evolving architectures; investigation of advanced prompting, retrieval augmentation, and tool calling to reduce logic errors; human-in-the-loop revision workflows; and systematic evaluation in broader regulatory and industrial domains. Silent Failure Rates should be central to the assessment of any AI system deployed in cyber-physical or safety-critical environments. The integration of deterministic safety wrappers remains imperative for AI deployment in the construction sector and beyond.
Conclusion
This assessment reveals critical vulnerabilities in zero-shot vibe coding for construction safety. User persona, syntactic reliability, and absence of defensive programming converge to produce deceptively functional but hazardous code artifacts. Current frontier LLMs lack the deterministic rigor necessary for unmitigated safety engineering; robust governance and architectural controls are required before integration into operational workflows. Future research must focus on technical, organizational, and workflow safeguards to reconcile the accessibility of LLMs with the demands of regulated, high-risk domains.