LLMs versus the Halting Problem: Revisiting Program Termination Prediction
Abstract: Determining whether a program terminates is a central problem in computer science. Turing's foundational result established the Halting Problem as undecidable, showing that no algorithm can universally determine termination for all programs and inputs. Consequently, automatic verification tools approximate termination, sometimes failing to prove or disprove; these tools rely on problem-specific architectures and abstractions, and are usually tied to particular programming languages. Recent success and progress in LLMs raises the following question: can LLMs reliably predict program termination? In this work, we evaluate LLMs on a diverse set of C programs from the Termination category of the International Competition on Software Verification (SV-Comp) 2025. Our results suggest that LLMs perform remarkably well at predicting program termination, where GPT-5 and Claude Sonnet-4.5 would rank just behind the top-ranked tool (using test-time-scaling), and Code World Model (CWM) would place just behind the second-ranked tool. While LLMs are effective at predicting program termination, they often fail to provide a valid witness as a proof. Moreover, LLMs performance drops as program length increases. We hope these insights motivate further research into program termination and the broader potential of LLMs for reasoning about undecidable problems.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
What is this paper about?
This paper asks a big question in computer science: Can LLMs, like GPT-5 and Claude, tell whether a computer program will stop or run forever? This is called “program termination.” The authors test several LLMs on a public set of C programs and compare them to top traditional verification tools. They find that the best LLMs can predict termination surprisingly well—almost as well as the best classic tools—but they struggle to produce formal proofs, and they get worse as programs get longer.
What problem are they trying to solve?
The main goal is to see if LLMs can reliably predict whether a program:
- Always stops for any input (Terminating), or
- Can get stuck in an endless loop for some input (Non-terminating)
This connects to the famous Halting Problem, which says there’s no single perfect algorithm that can always decide this for every possible program and input. Even so, in practice, tools and techniques try to approximate answers. The question is whether LLMs can do this well enough to be useful.
How did they test this? (Methods explained simply)
The researchers used a large, public benchmark called SV-Comp (International Competition on Software Verification), focusing on the “Termination” category for C programs. Each task is a short program with a known answer: does it terminate or not?
Here’s how they evaluated the LLMs:
- The model reads the C program.
- It predicts one of three outcomes:
- Terminating (T)
- Non-terminating (NT)
- Unknown (UNK) if it isn’t sure
- If the model says “Non-terminating,” it must also produce a “witness”—a proof. In SV-Comp, this proof is a special graph (think of it like a map of program states with arrows showing how the program moves). If there’s a loop in the graph that can run forever, that’s evidence of non-termination. A checker (UAutomizer) verifies the witness automatically.
To make the predictions more reliable, they used Test-Time Scaling with Consensus Voting:
- Generate multiple answers for the same program (like asking the model 20 times).
- Randomly pick 10 of those answers.
- If all 10 agree on T or NT, accept the decision.
- If they disagree, output “Unknown.” This reduces risky wrong answers.
They also tried a simpler kind of proof for non-termination: a logical condition describing inputs that cause infinite loops (for example, “x < 0 and y = 0”). They used a math tool (Z3) to check if the model’s condition was correct.
What did they find, and why does it matter?
The key results:
- Top LLMs did very well at predicting termination:
- GPT-5 and Claude Sonnet-4.5 ranked just behind the best traditional tool (PROTON) when using consensus voting. CWM ranked just behind UAutomizer.
- A non-reasoning baseline (GPT-4o) did much worse.
- LLMs often struggled to produce valid formal witnesses (the graph proofs):
- Even when they correctly said “Non-terminating,” the witness graph was frequently invalid or rejected by the checker.
- Formatting mistakes were rare for some models, but most errors were about content (the graph didn’t really prove an infinite path).
- Longer programs were harder:
- Scores dropped as program length increased.
- Wrong answers and “Unknown” decisions were more common for longer code.
- Consensus Voting helped:
- It led to fewer risky mistakes by outputting “Unknown” when answers disagreed.
- This improved the overall competition score, even though it lowered F1 (because Unknowns count as mistakes in F1).
- The simpler “logical condition” witness was promising:
- For shorter examples, models were more accurate at writing a condition like “x < 0 and y = 0” than at producing a full graph.
- These conditions are easier to understand and require fewer tokens.
Why this matters:
- Program termination touches safety and reliability. Non-terminating programs can freeze systems, waste resources, and cause failures.
- Even though the Halting Problem is undecidable in general, this study shows LLMs can be practically useful on real verification benchmarks.
- LLMs could become valuable teammates to traditional tools, especially if we combine them with formal checkers (neuro-symbolic systems).
What does this mean for the future?
This research suggests:
- LLMs have strong potential to help with hard reasoning tasks in software, especially undecidable properties (like “will this ever crash?” or “can this state be reached?”).
- To be truly useful, LLMs should be combined with symbolic checkers (tools that verify proofs rigorously), so the model’s ideas are always checked before use.
- Real-world testing on large codebases is the next step. Benchmarks are great, but everyday software is bigger and messier.
- Simpler witness formats (like logical conditions) could make LLM outputs more interpretable and easier to validate.
- As programs get longer, models need better scaling strategies to maintain accuracy.
Overall, while LLMs aren’t perfect at formal proofs yet, they’re surprisingly good at predicting program behavior—and with the right safety checks, they could make software verification faster and more accessible.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a single, consolidated list of concrete gaps and open questions that remain unresolved and could guide future research:
- Cross-language generalization: results are limited to C; it is unknown how well the findings transfer to other languages (e.g., Rust, Java, Python), especially with different memory models, type systems, and idioms.
- Multi-file and repository-scale settings: evaluation is confined to single-file SV-Comp tasks; effectiveness on real repositories (multi-module builds, headers, libraries, build systems) remains untested.
- Environment and I/O modeling: SV-Comp encodes nondeterminism via explicit variables; it is unclear how to handle real-world sources of nondeterminism (I/O, system calls, concurrency, interrupts).
- Concurrency and parallelism: termination under threads, atomics, and synchronization primitives is not evaluated.
- Robustness to code perturbations: sensitivity to refactorings, identifier renaming, comment changes, dead-code insertion, and obfuscation is unmeasured.
- Long-context scalability: performance degrades with code length; no techniques are proposed to handle very long programs (chunking, retrieval, hierarchical reasoning).
- Data leakage risk: no audit for potential training contamination (e.g., SV-Comp code in model pretraining corpora), leaving generalization claims uncertain.
- Witness validator dependence: only UAutomizer is used; cross-validator agreement (e.g., Ultimate vs. CPAchecker witnesses) and validator-specific biases are unknown.
- Official competition parity: comparisons to SV-Comp tools are not under official competition constraints (timeouts, validator sets, hardware); fairness and replicability of ranking remain open.
- Runtime, cost, and compute budgets: inference latency, token usage, and TTS overhead are not reported, obscuring the practicality of deployment at scale.
- Calibration and uncertainty: unanimity-based TTS is a coarse mechanism; quantitative calibration (e.g., reliability diagrams, proper scoring rules) and alternative aggregation thresholds remain unexplored.
- Safety thresholds for deployment: how to set acceptance/abstention policies (e.g., Unknown) to meet operational risk tolerances is not addressed.
- Class-imbalance and incentive effects: because NT predictions require a valid witness while T does not, possible bias toward T (to avoid penalties) is not analyzed or mitigated.
- Error taxonomy by program features: failures are not decomposed by control-flow patterns (nested loops, recursion), data structures (arrays, pointers), arithmetic (bit-precise, non-linear), or heap usage; per-subcategory breakdowns are missing from the main analysis.
- Structured witness generation: high witness-invalid rates suggest a need for constrained decoding (function-calling, schema-enforced JSON, graph grammars); such methods are not tested.
- Neuro-symbolic iterative refinement: there is no exploration of interactive loops where the LLM proposes a witness, a validator returns counter-constraints, and the LLM repairs the witness.
- Proof-carrying outputs: for T predictions, there is no requirement or mechanism to produce constructive termination arguments (e.g., ranking functions, variants); investigating LLM synthesis of such proofs is open.
- Alternative witness formats at scale: the proposed “domain witness” (logical precondition) is only tested on 40 simple NT cases; scalability to complex programs (arrays, heap, quantifiers, nonlinear arithmetic) is unassessed.
- Integration of domain witnesses with checkers: there is no pipeline that consumes a logical NT precondition and checks it with a verifier (e.g., via under-approximate symbolic execution) on large benchmarks.
- Generalization beyond termination: applicability to other undecidable semantic properties (memory safety, unreachability, equivalence) is hypothesized but not evaluated.
- Prompt and decoding ablations: sensitivity to prompt variants, temperatures, nucleus sampling, and reasoning styles (CoT vs. non-CoT) is not systematically studied.
- Fine-tuning and task-adaptive training: no experiments on finetuning open-weight models for termination tasks (including witness synthesis), leaving potential gains unexplored.
- Per-case qualitative diagnostics: limited qualitative analysis of failure modes (e.g., incorrect loop invariants, off-by-one constraints, misinterpreted bit-operations) restricts actionable insights.
- Formatting vs. content errors: while CWM shows frequent formatting issues, controlled studies on schema enforcement and repair (self-checking, structured tool use) are missing.
- Metric alignment with practice: SV-Comp scoring is used, but alignment with production risk metrics (e.g., minimizing false “terminates” at fixed abstention rate) is not demonstrated.
- Fair resource comparison: parity in resource limits (timeouts, memory) between LLM pipelines and classic tools is not established; comparative efficiency remains unclear.
- OOD generalization: robustness to distribution shifts (new libraries, coding styles, compiler intrinsics) and across SV-Comp years or other benchmark suites is untested.
- Effect of code augmentation: line-number insertion aids witness line attribution; the impact of such augmentations on performance and failure modes is not ablated.
- Handling of arrays, pointers, and aliasing: specific difficulties in pointer-heavy or aliasing-intensive code (common in C) are not characterized.
- Complex arithmetic reasoning: performance on non-linear, bit-level, or modular arithmetic reasoning (beyond the BitVectors subcategory headline) lacks targeted analysis.
- Human-in-the-loop efficiency: the utility of expert post-editing (time to fix a near-miss witness; UI/tooling needs) is not measured.
- Theoretical framing: no account of why LLMs may excel on practical instances of undecidable problems (e.g., distributional assumptions, learned heuristics) or how to bound risk.
- Open-source reproducibility: strongest results rely on proprietary GPT-5; full reproducibility and open baselines that match those scores are lacking.
- Multi-objective optimization of TTS: the trade-off between abstention rate, SV-Comp score, and F1 is not explored as a tunable objective (e.g., expected utility under deployment constraints).
- Security and adversarial robustness: vulnerability to adversarial prompts or code snippets crafted to elicit wrong termination predictions is not assessed.
- Benchmark construction: SV-Comp may not contain adversarially hard instances for LLMs; creating stress tests (synthetic diagonalization-like cases, obfuscated semantics) remains an open need.
- Release of the annotated NT-precondition set: availability, licensing, and plans to expand beyond 40 examples are unspecified, limiting community follow-up.
- Comparative proofs for T vs. NT: asymmetry remains (proof required for NT but not for T); designing symmetric evidence expectations and studying their impact is an open question.
Practical Applications
Practical Applications of “LLMs versus the Halting Problem: Revisiting Program Termination Prediction”
Below are actionable, real-world applications derived from the paper’s findings and methods. They are grouped into near-term deployments that can be adopted today and longer-term opportunities that need further research, scaling, or standardization.
Immediate Applications
- LLM-assisted termination risk triage in CI/CD
- What: Use LLMs (e.g., GPT-5, Claude Sonnet 4.5, or on-prem CWM) with test-time scaling (consensus voting) to label code changes as “terminating,” “non-terminating,” or “unknown,” routing high-risk or uncertain cases to formal tools (e.g., UAutomizer/CBMC or PROTON) and human reviewers.
- Sectors: Software/DevOps, Safety-critical (automotive, aerospace, medical devices), Finance (trading systems), Telecom (network daemons), Cloud/Infra.
- Tools/Workflows: GitHub/GitLab app; pre-merge quality gate; integration with UAutomizer as a validator for NT witnesses; automatic “unknown” escalation to formal checks.
- Assumptions/Dependencies: Availability of large-context LLMs; willingness to adopt a conservative “unknown” policy; code privacy constraints (use open-weight CWM for on-prem); primarily validated on C; witness generation still brittle—pair with symbolic validators.
- IDE “termination linting” and code review bots
- What: Inline warnings for potential infinite loops; quick explanation via the “divergence precondition” format (logical conditions for non-termination) for short routines and diffs. Suggests refactorings or guards.
- Sectors: Software development, Education (teaching program reasoning), Open-source maintenance.
- Tools/Workflows: VS Code/JetBrains extension; PR commenting bot; optional auto-insert of assertions/guards.
- Assumptions/Dependencies: Stronger for small/medium functions; provide “unknown” rather than overconfident answers on longer code; ensure prompt templates and line-numbering for witness alignment.
- Production monitoring and incident triage aid
- What: When a service hangs, analyze the implicated code to rapidly assess likely non-termination regimes (via LLM precondition-style explanations) and prioritize remediation.
- Sectors: Cloud/SRE, Enterprise IT, Telecom operations.
- Tools/Workflows: Incident runbooks that call the LLM analyzer; automated log-to-code localization; “unknown” falls back to engineers or symbolic tools.
- Assumptions/Dependencies: Requires mapping stack traces to relevant code slices; longer modules may need manual scoping for reliable analysis.
- Security pre-screening for DoS-style infinite loop risks
- What: Lightweight LLM screening to flag potential denial-of-service vectors arising from non-terminating paths before release.
- Sectors: Security, Web services, Embedded/IoT gateways.
- Tools/Workflows: Security CI stage; escalation to formal validators for flagged cases; artifacts stored as compliance evidence.
- Assumptions/Dependencies: False negatives/positives are possible—pair with symbolic checks or fuzzing; cost control for repeated TTS queries.
- Test generation and fuzzing prioritization
- What: Use LLM termination predictions to steer fuzzers toward loop-heavy regions; for NT-suspect paths, generate harnesses or targeted inputs informed by preconditions.
- Sectors: Software testing, Safety-critical verification.
- Tools/Workflows: Fuzzing orchestrators (e.g., integration with VeriFuzz-like flows); prioritize seeds and time budgets per risk score.
- Assumptions/Dependencies: Precondition witnesses are more mature for short functions; effectiveness depends on fuzzer integration and symbolic solvers.
- Curriculum and training for program reasoning
- What: Classroom exercises and automated tutors that explain termination/ non-termination using interpretable preconditions; compare against SMT-checked solutions (e.g., Z3).
- Sectors: Education, Professional upskilling.
- Tools/Workflows: LMS plug-ins; code challenge platforms; stepwise hints and validator-backed feedback.
- Assumptions/Dependencies: Scope to small/medium programs to ensure fidelity; maintain a library of validated examples.
- On-prem LLM termination screening for IP-sensitive code
- What: Deploy open-weight models like CWM to run screening within secured environments, enabling privacy-preserving triage.
- Sectors: Defense, Healthcare, Finance, Proprietary software.
- Tools/Workflows: Containerized CWM services; internal prompt libraries; UAutomizer/CBMC integration.
- Assumptions/Dependencies: Hardware availability; model quality lower than top proprietary LLMs; witness formatting requires care.
- Benchmarking and research baselines
- What: Use the paper’s pipeline (SV-Comp prompts, TTS, witness validation, precondition witnesses validated by Z3) to create reproducible baselines for academic and industrial labs.
- Sectors: Academia, Industrial research, Tool vendors.
- Tools/Workflows: Public SV-Comp Termination tasks; standardized prompts; open-source scripts for GraphML conversion and validator calls.
- Assumptions/Dependencies: Current focus on C; replication requires consistent tokenization, line numbering, and validator setup.
Long-Term Applications
- Neuro-symbolic termination verification at scale
- What: A production-grade verifier that combines LLMs for candidate reasoning (NT witnesses, ranking functions, preconditions) with SMT-based validators and automata refinement, targeting large, multi-language codebases.
- Sectors: Aerospace/Automotive/Medical (certifiable software), OS/Drivers, Large-scale enterprise systems.
- Tools/Workflows: LLM+UAutomizer/CBMC co-pilots; ranking function synthesis; counterexample-guided refinement loops.
- Assumptions/Dependencies: Needs reliable witness generation and better handling of long/complex code; regulatory acceptance and auditing pipelines.
- Continuous termination governance for safety-critical and regulated domains
- What: Policy frameworks and standards that require uncertainty-aware termination checks (LLM+validator) as part of certification or change control.
- Sectors: Healthcare (FDA/CE), Automotive (ISO 26262), Aerospace (DO-178C), Finance (operational resilience).
- Tools/Workflows: Evidence artifacts (validated GraphML, SMT logs, TTS consensus reports); procurement and audit processes recognizing hybrid verification.
- Assumptions/Dependencies: Standardization of witness formats and validators; third-party certification of tools; reproducibility and traceability.
- Repository-scale non-termination proving and code health dashboards
- What: Continuous, org-wide scanning to surface hotspots (modules with recurring NT risks), trends, and actionable ownership queues.
- Sectors: Large tech, Embedded/IoT fleets, Telecom vendors.
- Tools/Workflows: Incremental analysis; code slicing to mitigate context length constraints; dashboards integrated with issue trackers.
- Assumptions/Dependencies: Efficient scaling and cost control; robust partitioning for long code; model self-calibration across evolving repos.
- Cross-language termination analysis and language-agnostic workflows
- What: Extend beyond C to C++, Rust, Java, Python, and domain-specific languages; provide uniform termination gating in polyglot monorepos.
- Sectors: Enterprise platforms, Compiler tooling, Multi-language services.
- Tools/Workflows: Language adapters for AST/control-flow extraction; standardized intermediate representations; multi-language validators.
- Assumptions/Dependencies: Training/evaluation on non-C datasets; ensuring sound mappings to validators; language-specific idiosyncrasies.
- Runtime safeguards and synthesis of defensive code
- What: Auto-synthesis of guards (timeouts, watchdogs, loop invariants, ranking functions) proposed by LLMs and validated symbolically before merge.
- Sectors: Robotics/Autonomy, Real-time systems, Edge/IoT.
- Tools/Workflows: Refactoring suggestions in PRs; code transformations with automated proof obligations; runtime watchdog templates matched to preconditions.
- Assumptions/Dependencies: Proven soundness of synthesized artifacts; performance overhead of guards in real-time contexts.
- Advanced security analysis for algorithmic liveness and deadlock detection
- What: Generalize beyond loops to liveness properties and deadlocks; LLM generates candidate non-progress scenarios, verified by model checking.
- Sectors: Distributed systems, Blockchain/Smart contracts, OS kernels.
- Tools/Workflows: LLM-guided abstraction; integration with concurrency model checkers; compositional analyses.
- Assumptions/Dependencies: Significant research to extend beyond termination; scalable state-space validation.
- Broader undecidable property checks powered by LLMs
- What: Apply the same uncertainty-aware, validator-backed pattern to memory safety, unreachability, and equivalence checking (as suggested by Rice’s theorem context).
- Sectors: Compilers, Static analysis vendors, High-assurance software.
- Tools/Workflows: Property-specific prompts; witness types per property; SMT/abstract interpretation co-design.
- Assumptions/Dependencies: Property-specific validators and proofs-of-concept; adapting scoring/penalty schemes for deployment safety.
- Standardization of interpretable “domain witness” formats
- What: Establish a human-readable, SMT-checkable logical witness standard (e.g., divergence preconditions) that complements automaton witnesses for developer comprehension and audits.
- Sectors: Standards bodies, Tooling ecosystems, Education.
- Tools/Workflows: Open schemas; canonical SMT encodings; equivalence checking pipelines (e.g., Z3).
- Assumptions/Dependencies: Community agreement and benchmark evolution; demonstrable interoperability with validators.
- Agentic coding systems with termination-aware guardrails
- What: Autonomous code agents that include termination checks as a default safety gate before proposing or merging changes, escalating “unknowns.”
- Sectors: AI-assisted software engineering, Low-code platforms.
- Tools/Workflows: Agent pipelines with TTS consensus; gated deployment; continuous learning from validator feedback.
- Assumptions/Dependencies: Reliability at scale; avoidance of overconfidence; robust human-in-the-loop patterns.
Key Cross-cutting Assumptions and Dependencies
- Model capabilities and access: Best results reported with GPT-5/Claude; open-weight CWM is promising for on-prem but lower-performing. Costs rise with test-time scaling.
- Language and scope: Results validated on C; generalization to other languages and large, multi-module codebases is non-trivial.
- Proof obligations: LLMs currently struggle to produce valid automaton witnesses consistently; practical systems should pair LLM predictions with symbolic validators (e.g., UAutomizer/CBMC/SMT solvers).
- Reliability and calibration: Performance degrades with code length/complexity; adopt strict “unknown” policies to prevent unsafe approvals.
- Privacy/compliance: Cloud LLMs may be unsuitable for sensitive code; open/local deployments mitigate leakage risks.
- Operationalization: Requires line-number-stable inputs and standardized intermediate formats (e.g., GraphML) for witness validation; prompt sensitivity necessitates hardened prompt templates and regression testing.
Glossary
- AProVE: A termination analysis tool evaluated in SV-Comp that attempts to prove or refute program termination automatically. "we report results for the top three SV-Comp 2025 verification tools: PROTON~\citep{metta2024proton}, UAutomizer~\citep{heizmann2023ultimate}, and AProVE~\citep{emrich2023aprove}."
- Automata over program statements: A representation of program behavior as automata whose states/transitions correspond to program constructs, used for verification. "State-of-the-art tools, such as UAutomizer~\cite{heizmann2023ultimate} and PROTON~\cite{metta2024proton}, use automata over program statements or leverage symbolic execution via CBMC~\cite{kroening2014cbmc}."
- Bi-abduction: A separation-logic technique that infers both missing preconditions and frame conditions, aiding scalable program analysis. "leverages bi-abduction \cite{Calcagno11} and under-approximation \cite{OHearn20} to achieve soundness at scale."
- Bootstrap sampling: A resampling method (with replacement) used to estimate variability and robustness in stochastic evaluations. "we employ bootstrap sampling."
- CBMC: A bounded model checker for C that supports symbolic execution and SAT/SMT-based verification. "PROTON uses CBMC \cite{kroening2014cbmc} to soundly detect non-termination via recurrent loop states and generate witnesses, falling back on high-confidence termination checks from VeriFuzz \cite{metta2023verifuzz}."
- Code World Model (CWM): An open-weights code-focused LLM trained to model program execution (“world models”) for reasoning tasks. "and Code World Model (CWM) would place just behind the second-ranked tool."
- Consensus Voting: An inference-time aggregation scheme where multiple sampled predictions must agree unanimously to yield a decision. "we apply Test-Time Scaling (TTS) with Consensus Voting."
- Divergent Hoare triple: A Hoare-logic formulation capturing non-termination (divergence) properties of programs. "and can be viewed as the weakest precondition of the divergent Hoare triple \cite{RaadVO24}."
- GraphML: An XML-based format for representing graphs, used to exchange and validate witness graphs. "The predicted JSON is converted to GraphML and validated by a witness validator (e.g., UAutomizer)."
- Non-deterministic variables: Variables whose values are chosen arbitrarily (not fixed by the program), modeling unknown inputs. "The program may include multiple non-deterministic variables, which can be assigned random values during execution."
- PROTON: A state-of-the-art verification tool that detects non-termination (e.g., via CBMC) and generates witnesses. "PROTON uses CBMC \cite{kroening2014cbmc} to soundly detect non-termination via recurrent loop states and generate witnesses, falling back on high-confidence termination checks from VeriFuzz \cite{metta2023verifuzz}."
- Ranking functions: Functions that map program states to a well-founded order and strictly decrease on each loop iteration to prove termination. "synthesize termination arguments, such as ranking functions \cite{MSFTProgVer}"
- Recurrent loop states: Loop states that reappear during execution and can witness potential infinite looping (non-termination). "to soundly detect non-termination via recurrent loop states"
- Recurrent sets: Sets of states visited infinitely often, serving as witnesses for non-termination. "identifies recurrent sets \cite{Gupta08} as non-termination witnesses"
- Rice’s Theorem: A fundamental result stating that any non-trivial semantic property of programs is undecidable. "By Riceâs Theorem~\cite{rice1953classes}, every non-trivial semantic property of programs is undecidable"
- SMT-based refinement: An iterative verification technique that uses SMT solving to eliminate infeasible paths and refine program abstractions. "UAutomizer instead encodes programs as automata and applies SMT-based refinement to eliminate infeasible paths."
- SMT solver: A Satisfiability Modulo Theories engine for checking satisfiability of logical constraints over theories like arithmetic and arrays. "using an SMT solver to validate the feasibility of an infinite execution."
- Symbolic execution: Program analysis that executes code with symbolic inputs to explore many paths at once via constraints. "or leverage symbolic execution via CBMC~\cite{kroening2014cbmc}."
- SV-Comp: The International Competition on Software Verification, a benchmark suite and scoring framework for verification tools. "the Termination category of the International Competition on Software Verification (SV-Comp) 2025."
- Test-Time Scaling (TTS): A technique that samples multiple model outputs at inference and aggregates them (e.g., via consensus) to improve reliability. "we apply Test-Time Scaling (TTS) with Consensus Voting."
- UAutomizer: A verification and witness-validation tool that analyzes automata-encoded programs using SMT-based techniques. "we use the UAutomizer tool~\cite{heizmann2023ultimate}, which symbolically analyzes the candidate infinite path by accumulating constraints and using an SMT solver to validate the feasibility of an infinite execution."
- Under-approximation: An analysis approach that considers a subset of behaviors, ensuring soundness for detected bugs or non-termination. "leverages bi-abduction \cite{Calcagno11} and under-approximation \cite{OHearn20} to achieve soundness at scale."
- VeriFuzz: A system providing high-confidence termination checks that PROTON can fall back on. "falling back on high-confidence termination checks from VeriFuzz \cite{metta2023verifuzz}"
- Weakest precondition: The least restrictive input condition that guarantees a specified postcondition; used here for characterizing divergence conditions. "This is related to the fundamental concept of weakest precondition from program verification \cite{Dijkstra1976}."
- Witness automaton: A graph encoding an infinite execution path that serves as a proof of non-termination. "must additionally output a witness automaton as a proof in JSON format"
- Witness validator: A tool that checks whether a provided witness (e.g., automaton) is valid evidence of non-termination. "validated by a witness validator (e.g., UAutomizer)."
- Z3 Theorem Prover: A widely used SMT solver for checking logical formulas and proving equivalences. "we use the Z$3$ Theorem Prover~\cite{de2008z3} to automatically verify equivalence between predicted and ground-truth expressions."
Collections
Sign up for free to add this paper to one or more collections.