Papers
Topics
Authors
Recent
Search
2000 character limit reached

Broken by Default: A Formal Verification Study of Security Vulnerabilities in AI-Generated Code

Published 7 Apr 2026 in cs.CR, cs.AI, and cs.SE | (2604.05292v1)

Abstract: AI coding assistants are now used to generate production code in security-sensitive domains, yet the exploitability of their outputs remains unquantified. We address this gap with Broken by Default: a formal verification study of 3,500 code artifacts generated by seven frontier LLMs across 500 security-critical prompts (five CWE categories, 100 prompts each). Each artifact is subjected to the Z3 SMT solver via the COBALT analysis pipeline, producing mathematical satisfiability witnesses rather than pattern-based heuristics. Across all models, 55.8% of artifacts contain at least one COBALT-identified vulnerability; of these, 1,055 are formally proven via Z3 satisfiability witnesses. GPT-4o leads at 62.4% (grade F); Gemini 2.5 Flash performs best at 48.4% (grade D). No model achieves a grade better than D. Six of seven representative findings are confirmed with runtime crashes under GCC AddressSanitizer. Three auxiliary experiments show: (1) explicit security instructions reduce the mean rate by only 4 points; (2) six industry tools combined miss 97.8% of Z3-proven findings; and (3) models identify their own vulnerable outputs 78.7% of the time in review mode yet generate them at 55.8% by default.

Authors (2)

Summary

  • The paper demonstrates that seven LLMs generate exploitable code with a mean vulnerability rate of 55.8%, as confirmed by Z3 SMT-based proofs.
  • The authors deploy the COBALT engine to structurally extract vulnerability patterns and encode them as SMT constraints, linking formal proofs with real-world exploits.
  • The study reveals that static analysis tools detect only 7.6% of vulnerabilities, highlighting the need for robust formal verification in AI-generated code.

Formal Verification Analysis of AI-Generated Code Security Vulnerabilities

Introduction

"Broken by Default: A Formal Verification Study of Security Vulnerabilities in AI-Generated Code" (2604.05292) delivers a comprehensive empirical evaluation of code security in outputs from seven frontier LLMs. This study leverages the rigor of formal verification via the Z3 SMT solver to establish concrete exploitability of vulnerabilities, surpassing prior pattern-based or heuristic approaches. The work covers 3,500 code artifacts generated from 500 security-sensitive prompts, mapped to critical Common Weakness Enumeration (CWE) classes, and quantifies model susceptibility, detection tool efficacy, model self-awareness, and the ultimate limits of current mitigation strategies.

Methodological Rigor and Experiment Design

The analysis pipeline hinges on the COBALT engine, a multi-stage static tool which (1) identifies candidate vulnerability patterns using structural extraction, (2) encodes those patterns as SMT constraints for Z3, and (3) extracts concrete witness inputs for all Z3-proven SAT cases, yielding mathematical proof of exploitability. Experiments encompass coverage across five high-impact CWE classes: buffer (CWE-131/190), integer (CWE-190/195), authentication (CWE-916), cryptography (CWE-327/330), and input handling (CWE-89/22/78). The prompt set is constructed to elicit practical, non-adversarial code representations relevant to real-world developer workflows.

Seven LLMs—including GPT-4o, Claude Haiku 4.5, Gemini 2.5 Flash, and Llama 3.3 70B—were evaluated under strictly controlled API configurations, with all outputs generated at deterministic zero-temperature settings to maximize reproducibility and analysis tractability.

Quantitative Findings: Code Generation Vulnerability Landscape

Aggregate Vulnerability Rates

Across the 3,500 samples, the mean vulnerability rate is 55.8%. GPT-4o produces vulnerable code at 62.4% (grade F), while Gemini 2.5 Flash achieves the lowest (yet unacceptably high) rate at 48.4% (grade D). No model attains a vulnerability rate better than grade D. Severity analysis indicates dominant proportions of CRITICAL vulnerabilities, with memory allocation (67%) and integer arithmetic (87%) categories exhibiting the highest model-agnostic failure rates.

Notably, 1,055 vulnerabilities are not merely pattern matches but are formally proved exploitable via concrete Z3 witnesses. Runtime validation using GCC AddressSanitizer confirmed exploitable memory faults in 6 of 7 real-world code harnesses, empirically linking formal findings to practical exploit outcomes.

CWE Distribution and Dominant Failure Modes

Memory management and integer vulnerabilities are the modal failure modes, with all LLMs consistently generating unsafe malloc size calculation patterns:

1
int* buf = malloc(n * sizeof(int)); // unsafe: no overflow guard

Safe handling, requiring bounds checks as per:

1
2
if (n > SIZE_MAX / sizeof(int)) return NULL;
int* buf = malloc(n * sizeof(int)); // safe

is exceedingly rare, indicating the presence of strong negative inductive biases from internet-scale code corpora and insufficient correction via post-pretraining alignment.

Evaluation of Mitigation: Security Prompting and Static Analysis Tools

Prompt Conditioning Fails to Mitigate Vulnerabilities

Adding explicit security instructions to system prompts provides, on average, a paltry 4% reduction in vulnerability rates (from 64.8% to 60.8%) on a subset of 50 prompts. Four of five models remain at grade F, and some models (e.g., Llama 3.3 70B) perform worse with security-targeted instructions. Category-level improvement is marginal, especially for memory and integer classes. This supports the hypothesis that the probability of unsafe code emission is anchored in the learned prior and is not substantially modifiable by instruction tweaking alone.

Structural Limitations of Industry Static Analysis

Six widely used commercial and open-source static analysis tools (Semgrep, Bandit, Cppcheck, Clang SA, FlawFinder, CodeQL) exhibit catastrophic blind spots: collectively detecting only 7.6% of the vulnerabilities, and missing 97.8% of formally proven (Z3 SAT) bugs. No tool—regardless of configuration—found a single instance of integer overflow in allocation arithmetic, and CodeQL, despite advanced dataflow reasoning, failed to detect any confirmed vulnerabilities (0/90). This exposes a foundational limitation: pattern and path-sensitive analyzers cannot reason about symbolic arithmetic over unconstrained attacker-controlled inputs, a gap only SMT-based approaches can bridge at scale.

Generation–Review Asymmetry and Model Self-Knowledge

A notable experimental result is the revealed generation–review asymmetry. When models were tasked to review code they had themselves generated (and which had been proven vulnerable via Z3), they successfully identified vulnerabilities in 78.7% of cases. However, those vulnerabilities were present in 55.8% of their direct generations by default. This demonstrates that models encode latent knowledge of these vulnerability patterns but fail to activate corresponding safety constraints during generation. The finding indicates that RLHF or filter-based instruction fine-tuning are insufficient for bridging the gap between security assessment and reliable code emission.

Implications and Recommendations

This study's findings have direct implications for AI-powered software engineering pipelines, especially in security- and safety-critical contexts. AI-generated code, especially in C/C++, must be assumed insecure by default and subject to the same or greater scrutiny as unaudited legacy code. Security prompt prefixes are ineffective as meaningful mitigations. Industry-standard static analyzers are grossly inadequate in detecting the most prevalent, dangerous bug classes emitted by LLMs. Integrating formal SMT-based verification or comprehensive manual audit into deployment pipelines is mandatory, particularly for dynamic memory and boundary-sensitive logic.

Limitations and Future Directions

While the prompt set (500 prompts) is broad and the model selection state-of-the-art, results are constrained to five major CWE categories and deterministic decoding at temperature zero. Further work extending secure prompt ablations to the full prompt suite, covering additional programming languages (Go, Rust, JavaScript), and evaluating model fine-tuning specifically for memory arithmetic and integer overflow mitigation is necessary. Multiturn and retrieval-augmented generation, as well as adversarial prompt slicing, represent unexplored dimensions.

Conclusion

This formal verification study (2604.05292) systematically demonstrates that current-leading LLMs reliably generate code with provably exploitable vulnerabilities across a majority of reasonable, real-world prompt scenarios—even under explicit security prompting. Standard static analysis tools are inadequate, detecting only the most superficial patterns, and fail to flag structurally hard vulnerabilities confirmed by mathematical proofs and runtime exploits. LLMs can recognize vulnerabilities in review but fail to prevent them during generation, underscoring the limits of present-day alignment approaches. Formal SMT-based analysis with tools like Z3 stands as the only scalable solution for ground-truth vulnerability certification in LLM-generated code. Rigorous security auditing and methodological advances are essential as AI coding assistants are further integrated into production pipelines for systems code.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

What is this paper about?

This paper asks a simple but important question: when AI tools write code for us, how often is that code unsafe? The authors tested seven popular AI coding models and used math-based checks to see if the code could actually be broken in real life. Their big takeaway: more than half of the code these AIs wrote had security problems that could be exploited.

What were the researchers trying to find out?

They focused on six plain questions:

  • How often do leading AI models write code with security bugs?
  • Which models are better or worse, and what kinds of bugs show up most?
  • Can those bugs be used to crash or attack a program for real, not just in theory?
  • Do clear “please be secure” instructions help reduce bugs?
  • How well do common code-checking tools catch these AI-made bugs?
  • Can the models spot their own mistakes when asked to review their code?

How did they test this? (Explained simply)

Think of the process like inspecting bridges built by robots:

  • They gave the AI models 500 real-world coding tasks that are known to be risky if done wrong (like building a safe login system or handling memory in C). They repeated these across seven models, producing 3,500 code samples total.
  • They checked the code with a “math detective” called Z3. This is an SMT solver—a tool that uses strict math rules to ask: “Is there any input that will make this code fail in a dangerous way?” If the answer is yes, Z3 also gives a “witness,” which is the exact input that causes the bug. That’s like a lockpick that exactly fits a flawed lock.
  • They used a pipeline called COBALT to:
    • Find suspicious code patterns,
    • Translate those patterns into math for Z3 to analyze,
    • Collect witnesses (the exact inputs that cause problems).
  • For extra proof, they ran some of the code with AddressSanitizer (ASAN), which is like a safety siren that goes off if the program hits a memory error while running.
  • They also compared COBALT to six industry tools (think “spell-checkers for code”) to see what those tools could catch.
  • Finally, they asked each model to review its own buggy code to see if it could spot the problem after the fact.

Key terms in everyday language:

  • Formal verification: Proving with math that something can go wrong, not just guessing.
  • SMT solver (Z3): A math engine that checks if a bug is truly possible and shows how to trigger it.
  • Witness: The exact input that makes the bug happen.
  • AddressSanitizer (ASAN): A tool that makes programs scream when they step out of bounds in memory.
  • Static analysis tools (like Semgrep, Bandit, Cppcheck, CodeQL): Automated “code spell-checkers” that look for risky patterns—usually without running the code.

What did they discover, and why does it matter?

Here are the main results, in simple terms:

  • More than half of AI-written code was vulnerable: On average, 55.8% of the code had at least one security flaw. The best model still made unsafe code about 48% of the time, and the worst about 62% of the time. No model “passed” by everyday grading standards.
  • The most common—and dangerous—mistakes were about numbers and memory:
    • Integer overflows: When multiplying numbers like “n * size” wraps around unexpectedly, leading to allocating too little memory. This can cause crashes or security holes.
    • Memory allocation errors in C/C++: Not checking if a number is too large before allocating memory, which can corrupt memory.
  • The math proofs held up in practice: They turned seven example bugs into real tests and six of them caused crashes or data leaks when run—proving these aren’t just theoretical.
  • Telling models “be secure” barely helped: Adding explicit security instructions reduced problems by only about 4 percentage points. Most models still produced a lot of bad code.
  • Popular code-checking tools missed almost all of the serious, proven bugs: The six industry tools combined flagged only 7.6% of cases, and they missed 97.8% of the bugs that the math engine (Z3) proved were exploitable. In other words, these tools usually don’t catch the kinds of math-based memory errors that AIs often introduce.
  • Models can find bugs after they write them: When asked to review their own code, models correctly identified about 79% of the proven bugs. But they still generated bad code more than half the time. This shows a “generation–review asymmetry”: they know the rules when checking, but don’t consistently apply them when writing.

Why this matters:

  • Many developers now use AI assistants for real projects. If the code is unsafe this often, teams can accidentally ship serious vulnerabilities unless they perform careful reviews and testing.

What does this mean going forward?

  • Don’t trust AI-written code by default, especially in languages like C/C++ where memory safety is tricky.
  • Simply telling the AI “write secure code” is not enough.
  • Don’t rely only on common static analysis tools—they often miss the exact kinds of bugs AIs introduce (like integer overflows in memory allocation).
  • Use stronger checks where it counts:
    • Compile and run with sanitizers (like AddressSanitizer) to catch memory errors during testing.
    • Where possible, use formal methods (like Z3) or careful human reviews for risky code paths (especially any code that calculates sizes for memory).
  • For researchers and tool builders: math-based verification (formal verification) is essential for catching the deep, arithmetic-related bugs that simpler tools miss.

In short: The study shows that, today, AI code assistants tend to produce insecure code by default. The safest path is to treat AI-generated code as a first draft that must be checked—preferably with strong tools and human review—before it goes anywhere near production systems.

Knowledge Gaps

Unresolved knowledge gaps, limitations, and open questions

The paper advances formal verification of AI-generated code but leaves several concrete research gaps and open questions for future work:

  • Generalizability beyond five CWE families: Evaluate additional categories (e.g., authorization logic, race conditions/deadlocks, resource management, deserialization, SSRF/SSRF-like network issues, XSS/CSRF) to determine whether the high vulnerability rates persist outside MEM/INT/INP/AUTH/CRYPTO.
  • Language coverage expansion: Assess languages beyond C/C++ and Python (e.g., Rust, Go, Java, C#, JavaScript/TypeScript, PHP) to see whether memory-safe languages and different ecosystems shift vulnerability types and rates.
  • Real-world context evaluation: Move from prompt-synthesized snippets to repo-scale, multi-file, buildable projects (with unit tests and CI) to capture context-dependent vulnerabilities and realistic dataflows.
  • Multi-turn and tool-augmented workflows: Measure whether step-by-step reasoning, chain-of-verification, self-review/repair loops, retrieval-augmented generation, or policy templates reduce the vulnerability rate relative to single-shot generation.
  • Sampling strategy effects: Assess temperature, top-p, n-best sampling, self-consistency/majority vote, and reranking by review criteria; quantify trade-offs between diversity, correctness, and security.
  • Secure prompt ablation at scale: Repeat the “secure prompt” ablation on the full 500-prompt v3 corpus; systematically vary prompt phrasing, placement (system vs. user), and inclusion of secure examples to map sensitivity.
  • Generation–review asymmetry mitigation: Test interventions that force models to apply review knowledge during generation (e.g., generate-with-checklist, dual-pass generate→review→regenerate pipelines) and quantify reductions in vulnerability rates.
  • COBALT recall is unknown: Establish recall via (a) manual expert audits on a statistically powered sample, and/or (b) seeded/known-vulnerable ground truths to estimate how many vulnerabilities COBALT misses.
  • PATTERN MATCH precision: Quantify false-positive rates for non–Z3-proven findings through manual triage and/or dynamic confirmation to avoid inflating headline vulnerability rates.
  • SMT encoding fidelity to real toolchains: Rigorously align Z3 encodings with language and compiler semantics (integer promotions, signedness, 32/64-bit size_t, undefined behavior) and report per-artifact assumptions; replicate across compilers, optimization levels, and architectures.
  • Threat model specification: Clearly define which inputs are attacker-controlled and under what preconditions; evaluate sensitivity of results to different trust assumptions to avoid over-approximation in isolated snippets.
  • Witness feasibility under resource constraints: Analyze whether Z3-produced witnesses are practically triggerable given typical memory limits, OS constraints, and application-level input validation.
  • Larger-scale runtime validation: Expand beyond 7 PoCs by fuzzing or targeted test generation across a representative subset; correlate Z3 SAT classes with observed crash types and exploit chains.
  • Broader tool comparison: Include advanced/commercial SAST and formal tools (e.g., Coverity, SonarQube/SonarLint, Meta’s Infer, Frama-C/EVA, Astrée, CBMC, KLEE, Symbiotic, SMACK, Seahorn) and the latest CodeQL versions with expert-tuned configurations to bound the claimed structural gap.
  • Hybrid analysis baselines: Compare SMT-only to symbolic execution, model checking, and hybrid fuzzing+SMT pipelines; assess whether combinations improve detection and confirmation rates.
  • Whole-program reasoning: Evaluate how results change with interprocedural/whole-program analysis (vs. snippet-level) where call-site constraints and invariants may prove safety or create new attack paths.
  • CVSS severity mapping validity: Document how CVSS base metrics were derived per finding; conduct sensitivity analysis showing how different environmental metrics or assumptions affect grade distributions.
  • Statistical rigor of model comparisons: Report confidence intervals, effect sizes, and significance tests; use bootstrapping across prompts and repeated runs; analyze stability across API/model updates.
  • Temporal robustness: Longitudinally re-run the benchmark as models evolve to quantify drift, regressions, and improvements; evaluate the impact of provider-side safety patches.
  • Defensive libraries and idioms: Test whether steering models toward safe wrappers (e.g., checked allocation APIs, snprintf, parameterized queries, password hashing with salt+work factor) materially reduces vulnerabilities.
  • Secure fine-tuning efficacy: Perform controlled fine-tuning/RL experiments on secure coding datasets and measure transfer to unseen CWEs and categories; report costs vs. gains.
  • Python/dynamic-language semantics: Specify and validate SMT/taint encodings for dynamic typing, string/SQL APIs, and library behaviors; ensure exploitability proofs align with real interpreter/runtime semantics.
  • Environment dependence: Systematically vary OS/libc/Python versions and runtime hardening (e.g., ASLR, hardened malloc, sandboxing) to assess how environmental defenses mediate exploitability.
  • Scalability and performance: Quantify the computational cost and throughput of COBALT+Z3 on larger codebases; explore slicing, memoization, and incremental analysis to make verification practical in CI.
  • Benchmark representativeness: Assess and mitigate prompt-selection bias by cross-validating with external datasets (e.g., LLMSecEval, SecurityEval, real GitHub issues/PRs); solicit community-contributed prompts.
  • Automated repair quality: Beyond detection, evaluate whether models can reliably patch their own insecure outputs and avoid regressions; measure correctness, security, and functionality after fixes.
  • Developer workflow integration: Prototype IDE/CI integrations combining LLMs with formal checks and auto-fixes; run user studies to measure net vulnerability reduction, productivity impacts, and adoption barriers.

Practical Applications

Immediate Applications

The following applications can be deployed with existing tools, data, and workflows described in the paper (COBALT pipeline, Z3 SMT, ASAN), leveraging the study’s findings on high default vulnerability rates, the formal verification approach, the generation–review asymmetry, and the detection gap in industry tools.

  • Industry (software/DevSecOps) — CI “formal verification gate” for AI-generated code
    • Sector: software, finance, healthcare, energy, robotics/embedded, telecom.
    • What: Add a GitHub/GitLab CI job that runs COBALT’s Z3-based checks on new/modified code, especially C/C++ paths involving allocation and arithmetic; block merges on Z3 SAT findings and attach witness inputs.
    • Tools/products/workflows: “COBALT-as-a-GitHub-Action,” pre-commit hooks for C/C++; CI profiles that compile and run ASAN harnesses automatically using extracted Z3 witnesses.
    • Assumptions/dependencies: Availability of Z3/COBALT; buildable artifacts/harnesses; organizational tolerance for added CI latency; initial focus on languages/paths covered by the study (C/C++, some Python patterns).
  • Industry (software/DevSecOps) — Dual-pass generation + self-review policy for AI code
    • Sector: software across domains.
    • What: Enforce a workflow where LLM-generated code must pass (1) same-model self-review and (2) Z3 checks before merge, exploiting the 78.7% self-review detection rate and using formal verification as the pass/fail arbiter.
    • Tools/products/workflows: GitHub PR template with an automated LLM review step; automatic inline comments; mandatory green check from Z3/COBALT job.
    • Assumptions/dependencies: API access to the same model used for generation; cost and latency budget for LLM review; guardrails to avoid model hallucination in reviews.
  • Industry (software/DevSecOps) — Sanitizer-by-default builds for AI-touched code paths
    • Sector: embedded systems, robotics, automotive, fintech backends, medical devices.
    • What: Compile and test all AI-generated or -modified C/C++ with -fsanitize=address,undefined in CI and nightly pipelines; prioritize failures linked to Z3 witness values.
    • Tools/products/workflows: Build matrix with sanitizer profiles; witness-driven PoC harness execution; automatic artifact triage in issue trackers.
    • Assumptions/dependencies: Adequate test harnesses; performance overhead acceptable in test environments; selective targeting for critical modules to control cost.
  • Industry (tooling vendors) — SAST augmentation with SMT-backed checks
    • Sector: security tooling, IDEs, code scanning platforms.
    • What: Integrate SMT-based arithmetic reasoning (bit-vector semantics) into SAST engines to detect integer-overflow-in-allocation patterns that current tools miss (97.8% of Z3-proven cases).
    • Tools/products/workflows: CodeQL plugin or rule pack leveraging SMT; Semgrep “extended” rules that invoke Z3 for arithmetic constraints; VS Code/JetBrains extensions surfacing Z3 witnesses inline.
    • Assumptions/dependencies: Engineering to embed SMT; performance tuning to limit solver invocations to hot spots; licensing of SMT components.
  • Industry (LLM providers) — Red-teaming and regression benchmarking with the BBD dataset
    • Sector: AI/ML, foundation model providers.
    • What: Adopt the paper’s 500-prompt corpus and Z3-based labeling as a secure-code benchmark; publish model releases with “BBD grade” and Z3-proven vulnerability counts; track generation–review asymmetry as a release KPI.
    • Tools/products/workflows: Internal eval harness running COBALT; model release gates that fail if secure score regresses; public scorecards.
    • Assumptions/dependencies: Consistent API temperature settings; compute budget for evaluations.
  • Industry (enterprise risk) — AI-origin risk labeling in SBOMs and compliance gates
    • Sector: finance, healthcare, critical infrastructure, government suppliers.
    • What: Extend SBOMs and policy gates to mark AI-generated components and to require SMT-based verification or human security review for high-risk modules (auth, crypto, memory management).
    • Tools/products/workflows: SBOM annotations “AI-Origin: true”; build compliance checks that reject artifacts with unresolved Z3 SAT findings; audit-ready logs of witnesses and mitigations.
    • Assumptions/dependencies: SBOM tooling integration; internal policies updated to recognize “AI-origin” risk class.
  • Academia — Controlled studies and curricula using released prompts and labeled artifacts
    • Sector: computer security education, software engineering research.
    • What: Incorporate the BBD dataset into secure coding courses; build lab exercises where students patch Z3-proven bugs; run replication studies and ablations (e.g., prompt engineering effectiveness).
    • Tools/products/workflows: Classroom CI with Z3; assignments to convert Z3 witnesses into unit tests; reproducibility packages.
    • Assumptions/dependencies: Instructor familiarity with SMT/ASAN; computing resources.
  • Policy/Governance — Procurement and internal policy baselines for AI-generated code
    • Sector: regulators, critical infrastructure operators, large enterprises.
    • What: Require vendors to attest that AI-generated C/C++ code undergoes SMT-based checks and sanitizer testing for specified CWE classes (CWE-190/131/916, etc.); prohibit reliance on “secure prompt” language as a control.
    • Tools/products/workflows: RFP clauses specifying verification evidence (witnesses, logs); audit checklists aligned with NIST AI RMF, ISO/IEC 27001 controls.
    • Assumptions/dependencies: Regulatory appetite; clear scope boundary to avoid overburdening low-risk software.
  • Daily life/individual developers and OSS maintainers — Practical guard-rails for unsafe patterns
    • Sector: open-source, indie devs, student projects.
    • What: Use template snippets that insert overflow guards before allocation; run a local pre-commit tool (lightweight COBALT profile) to flag malloc(n * sizeof(T)) without guard; prefer memory-safe languages (Rust) for new modules.
    • Tools/products/workflows: VS Code snippet pack; git hooks invoking Z3 for critical files; adding ASAN jobs to GitHub Actions for PRs labeled “AI-generated.”
    • Assumptions/dependencies: Basic CI setup; willingness to adopt guard templates; performance overhead acceptable for small projects.
  • Cross-sector — Witness-seeded fuzzing
    • Sector: all with C/C++ components.
    • What: Use Z3 witnesses as seeds for fuzzers (e.g., libFuzzer/AFL) to rapidly confirm and expand fault coverage.
    • Tools/products/workflows: Pipeline step that converts witnesses to fuzz corpora; crash triage automation.
    • Assumptions/dependencies: Fuzzer infrastructure; harness availability.

Long-Term Applications

The following applications require further research, scaling, model changes, or standardization beyond what is available today.

  • Industry/LLM providers — Closed-loop “verify-then-generate” code assistants
    • Sector: software tooling, IDEs, cloud IDEs.
    • What: Build assistants that iteratively generate code, run SMT checks, and self-correct until no Z3 SAT findings remain (or risk is waived), turning formal verification into an overview constraint.
    • Potential products: “Secure Mode” in Copilot-like tools; model-in-the-loop solvers; on-device solver integrations for IDEs.
    • Assumptions/dependencies: Efficient incremental SMT encoding; strong code repair strategies; acceptable latency for interactive use.
  • Industry/tools — Next-gen SAST that combines path analysis, taint, and SMT
    • Sector: security tool vendors.
    • What: Architect scanners that apply SMT-backed arithmetic reasoning over dataflow graphs to catch allocation arithmetic overflows and similar classes across languages.
    • Potential products: “SAST 2.0” platforms with solver kernels; hybrid static–symbolic engines.
    • Assumptions/dependencies: Solver performance at repository scale; noise control to keep findings actionable.
  • Industry/enterprise — Organization-wide AI code risk scoring and KPIs
    • Sector: large enterprises, regulated industries.
    • What: Define KPIs like “Z3-proven vulnerabilities per KLOC from AI contributions,” “generation–review asymmetry delta,” and use them in SDLC governance and vendor scorecards.
    • Potential products: Dashboards; policy thresholds that block releases exceeding risk budgets.
    • Assumptions/dependencies: Reliable detection of AI-origin contributions; culture and incentives aligned to security KPIs.
  • Academia/ML research — Training LLMs with formal counterexample feedback
    • Sector: AI/ML.
    • What: Incorporate Z3 witnesses into RLHF or direct preference optimization to penalize generation patterns that trigger formal exploits; explore architectures that internalize arithmetic safety constraints.
    • Potential outcomes: Models with lower baseline CWE-190/131 rates; reduced generation–review asymmetry.
    • Assumptions/dependencies: Stable training pipelines; large-scale solver-in-the-loop infrastructure; generalization beyond seen patterns.
  • Academia — Broader benchmarks and languages with verified labels
    • Sector: software security research.
    • What: Extend datasets to Go, Rust, JavaScript, Java; cover additional CWE classes (e.g., deserialization, concurrency, logic bugs); standardize a public “secure codegen” leaderboard with formal proofs.
    • Assumptions/dependencies: SMT encodings for each language/runtime; community maintenance; agreement on severity/labeling.
  • Policy/Standards — Certification and labeling for AI coding tools
    • Sector: standards bodies, regulators.
    • What: Create a certification that requires tools to meet minimum formal-verification-backed security performance on public benchmarks (e.g., “BBD grade C or better”); disclose evaluation results.
    • Potential outcomes: Market pressure toward secure-by-default assistants; transparency for adopters.
    • Assumptions/dependencies: Consensus on benchmarks/grades; governance to prevent gaming.
  • Industry/OSS — Safe allocation libraries and language-level guard rails
    • Sector: embedded/IoT, systems programming.
    • What: Standardize and widely adopt allocation APIs that make overflow-checked allocation the default (e.g., safe_malloc(n, sizeof(T)) with built-in checks); compilers emitting warnings or errors for unchecked patterns.
    • Assumptions/dependencies: Backward compatibility; performance overhead evaluation; ecosystem buy-in.
  • Cross-sector — Migration strategies to memory-safe languages driven by formal risk
    • Sector: critical infrastructure, automotive, aerospace, healthcare devices.
    • What: Prioritize rewrites or wrappers in Rust/other memory-safe languages for modules with repeated Z3 SAT findings; use solver results to drive risk-based refactoring roadmaps.
    • Assumptions/dependencies: Talent availability; interoperability boundaries; certification processes.
  • Industry/LLM providers — Model architecture changes to reduce generation–review asymmetry
    • Sector: AI/ML.
    • What: Explore dual-head models or mode-switching mechanisms that enforce review-time reasoning during generation (e.g., chain-of-thought with arithmetic guards verified by SMT).
    • Assumptions/dependencies: Safety guardrails for revealing internal reasoning; latency/cost budgets.
  • Education/Professional upskilling — Formal methods in developer tooling literacy
    • Sector: education, workforce development.
    • What: Standardize micro-credentials where developers learn to interpret SMT outputs, integrate witnesses into tests, and patch verified issues; integrate into secure coding certifications.
    • Assumptions/dependencies: Accessible curricula; industry recognition.
  • Policy/Governance — Sector-specific mandates for formal checks in high-assurance software
    • Sector: medical devices, payment systems, grid/SCADA, aviation.
    • What: Require formal verification for memory and arithmetic safety in software touching safety-critical or regulated data paths; audits to include solver logs and witness triage evidence.
    • Assumptions/dependencies: Regulatory rulemaking cycles; guidance on scope and proportionality.

Notes on feasibility, assumptions, and dependencies (cross-cutting)

  • Scope coverage: Current evidence and SMT encodings strongly target C/C++ memory and integer arithmetic issues; extending to other languages and vulnerability classes requires additional encodings and engineering.
  • Performance/scale: Solver invocations can be expensive; practical deployments should target high-risk diffs, run in parallel, and cache results.
  • Verification limits: Z3 proofs cover the encoded properties; “clean” does not prove overall program safety. Formal checks complement, not replace, human review and dynamic testing.
  • Data sensitivity: Running code and witnesses in CI must respect IP and privacy; on-prem or self-hosted solver services may be required in regulated sectors.
  • Developer experience: Usability (clear witnesses, minimal false positives) is essential for adoption; training and templates reduce friction.
  • Cultural change: Policies that reject “secure prompt” as a control and require evidence-based gates need executive sponsorship and security champions.

Glossary

  • Ablation (secure prompt ablation): An experiment that removes or modifies a component (here, adding explicit security instructions) to measure its effect on outcomes. "a secure prompt ablation"
  • AddressSanitizer (ASAN): A compiler instrumentation tool that detects memory errors such as buffer overflows and use-after-free at runtime. "GCC AddressSanitizer (ASAN)"
  • alloc-size-too-big: An AddressSanitizer/UBSan diagnostic indicating an attempted allocation with a size that overflows or is unrealistically large. "alloc-size-too-big"
  • Allocation arithmetic: Arithmetic used to compute memory allocation sizes (e.g., n * sizeof(T)), where overflows can lead to under-allocation and memory corruption. "integer overflow in allocation arithmetic"
  • AST (Abstract Syntax Tree): A structured, tree-like representation of source code used for program analysis beyond raw text. "AST-level patterns"
  • Bandit: A static analysis tool that finds security issues in Python code by rule-based scanning. "Bandit (medium+ severity)"
  • Bit-vector arithmetic: Arithmetic over fixed-width integers that model hardware-level wraparound, commonly used in SMT solvers. "which Z3's bit-vector arithmetic encodes directly"
  • BitVec(32): A Z3 data type representing a 32-bit fixed-width integer used in SMT encodings. "type BitVec(32)"
  • Clang Static Analyzer: A path-sensitive static analysis tool for C/C++/Objective-C that detects bugs without executing code. "Clang Static Analyzer"
  • COBALT analysis pipeline: The paper’s static analysis workflow that extracts vulnerability patterns and encodes them into SMT for Z3 to prove exploitability. "COBALT analysis pipeline"
  • CodeQL: A semantic code analysis engine that queries codebases as databases to find security vulnerabilities. "CodeQL v2.25.1 (security-extended)"
  • CVSS v3 base score: The standardized scoring system (Common Vulnerability Scoring System) for rating vulnerability severity. "CVSS v3 base score criteria"
  • CWE (Common Weakness Enumeration): A standardized catalog of software weakness types used for classifying vulnerabilities. "five CWE categories"
  • Dataflow analysis: A static analysis technique that tracks how values propagate through code to identify potential vulnerabilities. "dataflow and taint-tracking analysis"
  • FlawFinder: A C/C++ static analysis tool that flags potential security issues based on known risky functions and patterns. "FlawFinder 2.0"
  • Generation--review asymmetry: The observed gap where models can detect vulnerabilities during review but still generate vulnerable code by default. "generation--review asymmetry"
  • heap-buffer-overflow: A runtime error where a program writes beyond the bounds of heap-allocated memory. "AddressSanitizer: heap-buffer-overflow"
  • Modular arithmetic (32-bit): Arithmetic where values wrap around on overflow according to a fixed bit width, modeling machine integer semantics. "32-bit modular arithmetic---which Z3 provides"
  • OOB read: Short for out-of-bounds read; accessing memory beyond the valid range of a buffer. "OOB read"
  • Path-sensitive analysis: Static analysis that reasons about different execution paths to detect issues that occur only under certain conditions. "path-sensitive analysis"
  • PoC (proof-of-concept) harness: A minimal program that demonstrates a vulnerability can be triggered in practice. "proof-of-concept (PoC) harnesses"
  • RLHF (Reinforcement Learning from Human Feedback): A training approach where models are fine-tuned using feedback-driven rewards to shape behavior. "RLHF or instruction fine-tuning"
  • Satisfiability witness: A concrete assignment of inputs that satisfies an SMT formula, proving a vulnerability is exploitable. "Z3 satisfiability witnesses (Z3 SAT)"
  • Semgrep: A lightweight, pattern-based static analysis tool that uses rules to search for code smells and vulnerabilities. "Semgrep (all rulesets)"
  • SMT encoding: Translating program properties into logical formulas suitable for an SMT solver to reason about. "Z3 SMT Encoding"
  • SMT solver: A tool that decides logical formulas over background theories (e.g., bit-vectors, arithmetic) to prove or refute properties. "Z3 Satisfiability Modulo Theories (SMT) solver"
  • Taint propagation: Tracking how data from untrusted sources influences program state to identify potential exploit paths. "explicit taint propagation"
  • Unsigned 32-bit semantics: Integer behavior under 32-bit unsigned arithmetic, including wraparound on overflow. "under unsigned 32-bit semantics"
  • Z3: A high-performance SMT solver used to formally verify properties of code and produce counterexamples. "When Z3 returns SAT"
  • Z3 SAT: A result indicating Z3 found a satisfying assignment, often used here to denote a proven, exploitable vulnerability. "Z3 SAT"
  • Zip Slip: A path traversal vulnerability when extracting archive files that can overwrite arbitrary filesystem paths. "Zip Slip (path trav.)"

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 25 likes about this paper.