Is Vibe Coding the Future? An Empirical Assessment of LLM Generated Codes for Construction Safety

Published 14 Apr 2026 in cs.SE, cs.AI, and cs.HC | (2604.12311v1)

Abstract: The emergence of vibe coding, a paradigm where non-technical users instruct LLMs to generate executable codes via natural language, presents both significant opportunities and severe risks for the construction industry. While empowering construction personnel such as the safety managers, foremen, and workers to develop tools and software, the probabilistic nature of LLMs introduces the threat of silent failures, wherein generated code compiles perfectly but executes flawed mathematical safety logic. This study empirically evaluates the reliability, software architecture, and domain-specific safety fidelity of 450 vibe-coded Python scripts generated by three frontier models, Claude 3.5 Haiku, GPT-4o-Mini, and Gemini 2.5 Flash. Utilizing a persona-driven prompt dataset (n=150) and a bifurcated evaluation pipeline comprising isolated dynamic sandboxing and an LLM-as-a-Judge, the research quantifies the severe limits of zero-shot vibe codes for construction safety. The findings reveal a highly significant relationship between user persona and data hallucination, demonstrating that less formal prompts drastically increase the AI's propensity to invent missing safety variables. Furthermore, while the models demonstrated high foundational execution viability (~85%), this syntactic reliability actively masked logic deficits and a severe lack of defensive programming. Among successfully executed scripts, the study identified an alarming ~45% overall Silent Failure Rate, with GPT-4o-Mini generating mathematically inaccurate outputs in ~56% of its functional code. The results demonstrate that current LLMs lack the deterministic rigor required for standalone safety engineering, necessitating the adoption of deterministic AI wrappers and strict governance for cyber-physical deployments.

Abstract PDF Upgrade to Chat

Authors (1)

S M Jamil Uddin

Summary

The paper demonstrates that LLM-generated vibe coding poses significant safety risks, with high hallucination and silent failure rates affecting regulatory compliance.
A rigorous methodology using persona-driven prompts and sandbox evaluations revealed a stark contrast between syntactic execution and semantic, safety-critical reliability.
Findings underscore the need for robust, deterministic safety frameworks to integrate LLM outputs with traditional, formally verified systems.

Empirical Assessment of Vibe Coding for Construction Safety: Reliability, Risks, and Implications

Introduction

The paper "Is Vibe Coding the Future? An Empirical Assessment of LLM Generated Codes for Construction Safety" (2604.12311) investigates the efficacy and safety of vibe coding—a paradigm where non-technical users employ LLMs to generate executable code using natural language prompts. In the construction sector, this approach empowers on-site personnel to rapidly develop custom safety tools. However, the delegation of code generation to probabilistic LLMs raises profound concerns regarding logic fidelity, regulatory compliance, and the risk of silent failures in safety-critical applications.

Methodology

The study adopts a rigorous, three-phase empirical framework: (1) persona-driven prompt dataset creation, (2) code generation via frontier LLMs, and (3) execution and comprehensive evaluation in an isolated sandbox. Specifically, 150 prompts, designed to emulate three distinct construction personas (Safety Manager, Foreman, Worker), were used to generate 450 Python scripts from three consumer-accessible LLMs: Claude 3.5 Haiku, GPT-4o-Mini, and Gemini 2.5 Flash. Strict non-interactive architectural constraints were enforced, requiring headless backend scripts with hardcoded variables. Performance was evaluated both dynamically (sandbox execution) and statically (LLM-as-a-Judge against OSHA regulations), validated by human raters (Fleiss' Kappa = 0.84).

Results

Persona-Driven Hallucination Dynamics

The analysis demonstrates a statistically significant influence of user persona on LLM propensity to hallucinate safety-critical variables. Hallucination rates were lowest for the Safety Manager persona (64.70%), but escalated markedly for Foreman (82.00%) and Worker personas (87.30%). Odds ratios confirm that informal, ambiguous prompting substantially increases the likelihood of LLMs inventing missing safety data—a critical risk factor in safety engineering.

Architectural Robustness: Syntactic Reliability vs Defensive Design

Foundational code execution viability was high (~85%), with GPT-4o-Mini and Gemini 2.5 Flash successfully executing over 93% of scripts. Claude 3.5 Haiku lagged considerably at 68%. Notably, adherence to headless architecture (input()-free) was nearly perfect. However, defensive programming indicators were universally deficient: error handling mechanisms appeared in only 16% of scripts, with GPT-4o-Mini including them in a mere 3.3%. Models that excelled at syntactic correctness conspicuously failed to default to robust architectures.

OSHA Logic Fidelity and Silent Failures

OSHA logic fidelity was alarmingly poor; only 46.7% of scripts conformed to regulatory standards. More critically, silent failures were prevalent across executable scripts: overall, 45.3% contained mathematically incorrect safety calculations, silently violating OSHA requirements. GPT-4o-Mini exhibited the highest Silent Failure Rate (SFR) at 56.3%, Gemini 2.5 Flash at 41.4%, and Claude 3.5 Haiku at 35.3%. The prevalence of contextless outputs persisted: 31.56% of scripts failed to provide actionable directives, instead requiring end-users to interpret raw numerical results, further compounding risk.

Theoretical and Practical Implications

This empirical audit exposes a fundamental disconnect between syntactic and semantic code reliability in LLM-driven vibe coding. High execution rates actively mask hazardous mathematical errors, supporting the need to distinguish between overt crashes and silent regulatory violations. The statistical relationship between user context and hallucination underscores that LLM outputs inherit and amplify communication deficits inherent in informal, high-stress environments—prompt engineering alone cannot mitigate this.

From a practical perspective, unconstrained vibe coding is incompatible with the deterministic, zero-tolerance ethos of construction safety. Raw LLM outputs, regardless of apparent functionality, should not be trusted for stand-alone calculation or compliance-critical applications. The lack of actionable guidance in outputs increases cognitive load and error risk for non-technical personnel. Organizational strategies must prioritize layered, deterministic architectures, restricting LLMs to natural language interfaces and precluding them from autonomous logic engineering. Domain-specific regulatory logic must be executed exclusively by pre-audited, formally verified engines.

Future Directions

The study's findings delineate a clear research agenda: ongoing model assessment under evolving architectures; investigation of advanced prompting, retrieval augmentation, and tool calling to reduce logic errors; human-in-the-loop revision workflows; and systematic evaluation in broader regulatory and industrial domains. Silent Failure Rates should be central to the assessment of any AI system deployed in cyber-physical or safety-critical environments. The integration of deterministic safety wrappers remains imperative for AI deployment in the construction sector and beyond.

Conclusion

This assessment reveals critical vulnerabilities in zero-shot vibe coding for construction safety. User persona, syntactic reliability, and absence of defensive programming converge to produce deceptively functional but hazardous code artifacts. Current frontier LLMs lack the deterministic rigor necessary for unmitigated safety engineering; robust governance and architectural controls are required before integration into operational workflows. Future research must focus on technical, organizational, and workflow safeguards to reconcile the accessibility of LLMs with the demands of regulated, high-risk domains.

Markdown Report Issue