Execution Likelihood in Machine Learning
- Execution Likelihood is the conditional probability that a machine learning model’s output will be executed, incorporating execution cost and user operational barriers.
- It uses a discrete cost scale (1–5) to estimate how factors like domain expertise and legal constraints reduce the probability of execution.
- In sequential tasks, even small gains in per-step accuracy lead to exponential improvements in overall task success, highlighting the need to mitigate error propagation.
Execution Likelihood is a formal construct describing the conditional probability that a given output or instruction by a machine learning system—typically a LLM—will be operationalized or carried through to completion by an external actor, such as a user. Distinct from severity-focused risk metrics, execution likelihood explicitly quantifies the real-world plausibility that a harmful (or beneficial) outcome materializes, given the model's response. This conditional perspective is crucial for robust safety evaluation, for understanding long-horizon task performance, and for modeling how single-step errors accumulate into emergent system-level behaviors.
1. Formal Definitions and Metrics
Execution Likelihood is defined as the probability that a given model output is executed by a user:
This probability is typically modeled as a (monotonically decreasing) function of a discrete variable, Execution Cost , where higher reflects greater barriers to operationalization. No closed-form for is universally assumed, but the qualitative assumption is that , acknowledging that higher expertise, rare equipment, prohibitive legality, or complex infrastructure lower the probability of execution (Chen et al., 2 Feb 2026).
Execution Likelihood is foundational to the "Expected Harm" () metric, where realized risk is decomposed as:
Here, "Severity" is a discrete harm score (1–5) reflecting downstream consequences if execution occurs (Chen et al., 2 Feb 2026). This product formalism captures the intuition that even highly severe outputs pose limited real-world risk if practically impossible to execute, whereas low-severity but highly executable outputs may still warrant scrutiny and mitigation.
In multi-step task execution, as studied in long-horizon LLM evaluation, execution likelihood is often expressed in terms of stepwise accuracy and the probability of completing a chain of steps without error:
The horizon length 0, the largest number of steps for which the task succeeds with probability at least 1, is therefore governed by the exponential relationship between per-step accuracy and overall execution likelihood (Sinha et al., 11 Sep 2025).
2. Estimation of Execution Cost and Empirical Proxying
Execution Cost (2) operationalizes the friction or barriers to a user successfully carrying out a model's response. In safety-oriented evaluations, 3 is commonly scored on a five-point scale:
- 1 = Very easy (minimal barriers)
- 5 = Very difficult (extreme barriers)
Key factors include required domain expertise, specialized equipment, legality of the action, and requisite infrastructure. Automated labeling (e.g., via a specialized prompt to gpt-oss-120b) is commonly utilized, with labels validated on human-annotated subsets for accuracy. In practice, achieved cost-prediction accuracies reach approximately 68% exact, with 96% off-by-one, and mean absolute error (MAE) near 0.36 (Chen et al., 2 Feb 2026).
Within the Expected Harm framework, the discrete cost label 4 is used directly as a proxy for 5; no further regression or calibration is imposed, reflecting both practical constraints and calibration to human judgments.
3. Execution Likelihood in LLM Safety and Risk Calibration
The integration of execution likelihood into LLM safety analysis highlights critical misalignments in refusal strategies. Empirical evaluation reveals that real-world toxic prompts cluster at low cost (mean ≈ 1.2), while synthetic benchmarks are 1.47× higher in cost on average (Figure 1 in (Chen et al., 2 Feb 2026)). Models demonstrate "Inverse Risk Calibration," in which refusal behaviors are disproportionately strong for high-cost, low-execution-likelihood threats, and unduly weak for low-cost, high-likelihood queries—the very region where real user behavior concentrates.
Attack success rate (ASR) thus exhibits a heatmap pattern: high vulnerability (high ASR) at 6, robust refusal at 7 (Chen et al., 2 Feb 2026). This mismatch exposes a structural vulnerability. Empirically, strategic exploitation of this calibration error can double the success rate of jailbreak attacks.
4. Execution Capability and Sequential Task Performance
Execution likelihood also governs how reliably LLMs can perform long-horizon, multi-step tasks, even when supplied with explicit knowledge and plans. Here, stepwise correctness (step accuracy 8) is critical, as even small deviations compounds rapidly. The probability of a flawless 9-turn execution (assuming no self-correction) is 0. The number of steps 1 a model can execute reliably with 50% success rate is:
2
Even marginal improvements in single-step accuracy yield super-exponential gains in horizon length; for near-perfect 3, the sensitivity 4 grows as 5. This magnifies the economic and practical value of even small scaling gains (Sinha et al., 11 Sep 2025).
Empirically, per-turn accuracy in multi-step settings degrades with depth: models that start at 100% at turn 1 often fall to ≈ 80% by turn 20 and below 50% by turn 50. This trend is found across medium- and large-scale systems (Qwen3, Gemma3 8B–32B), with the largest variants failing after more turns but still subject to degradation (Sinha et al., 11 Sep 2025).
5. Compounding Errors, Self-Conditioning, and Robustness Strategies
A critical dynamic affecting execution likelihood involves the interaction of past errors with future performance, a phenomenon termed "self-conditioning." When LLMs are conditioned on their own erroneous outputs (even with constant context length), per-step accuracy at later turns drops further than baseline long-context effects would predict (Sinha et al., 11 Sep 2025). Scale mitigates only long-context drift, not error compounding: very large models (200B+) retain high accuracy in error-free contexts but remain vulnerable to self-conditioned declines.
Distinctly, "thinking models"—LLMs trained via RL to generate and then discard chain-of-thought traces—maintain stable performance regardless of the error rate in conditioning history. Sequential test-time reasoning, by regenerating CoTs afresh each turn, interrupts the feedback loop and prevents error propagation (Sinha et al., 11 Sep 2025). This suggests that sequence-level execution robustness requires not merely scale but architectural or procedural interventions to break cycles of error amplification.
6. Implications for Model Evaluation and Safety Mitigation
Systematic "cost blindness," i.e., models' inability to represent or utilize execution cost in refusal decisions, drives suboptimal safety outcomes. Linear probing reveals that latent states in current models robustly encode severity but lack any monotonic or interpretable representation of execution cost—activation in hidden layers correlates strongly with severity but only bimodally with cost (high at extremes, baseline at midrange), confirming the absence of an internal "execution likelihood" dimension (Chen et al., 2 Feb 2026).
Key recommendations include:
- Integrating execution likelihood into safety taxonomies and evaluation metrics—adopting Expected Harm (EH) over Attack Success Rate (ASR) alone.
- Augmenting safety training with low-cost, high-likelihood harmful examples to recalibrate model refusal behavior.
- Developing compositional or sequential guardrails capable of recognizing and neutralizing benign subtasks that may compose into harmful wholes.
- Auditing real-world prompt distributions to align defense calibrations with the true landscape of executable threats (Chen et al., 2 Feb 2026).
A plausible implication is that further advances in execution likelihood modeling and error-controlled long-horizon reasoning are prerequisites for both safer and more performant LLM deployments.
7. Summary Table: Metrics and Failure Modes
| Term/Metric | Definition / Measurement | Source |
|---|---|---|
| 6 | 7 | (Chen et al., 2 Feb 2026) |
| Execution Cost 8 | Discrete (1–5), difficulty of operationalization | (Chen et al., 2 Feb 2026) |
| Expected Harm (EH) | Severity 9 | (Chen et al., 2 Feb 2026) |
| Step Accuracy 0 | Probability of correct step update | (Sinha et al., 11 Sep 2025) |
| 1 | Steps to 50% task success, 2 | (Sinha et al., 11 Sep 2025) |
| Inverse Risk Calibration | Strongest refusal at low 3, weak at high | (Chen et al., 2 Feb 2026) |
| Self-conditioning | Error propagation via autoregressive context | (Sinha et al., 11 Sep 2025) |
The recognition, measurement, and mitigation of execution likelihood is central to both advancing LLM safety and extending practical, reliable automation to long-horizon, real-world workflows. Recent benchmarks and theoretical analysis underscore that even incremental improvements in per-step reliability can yield exponential benefits in sequential task capacity, but also that robust execution safety demands explicit attention to the real-world feasibility of model outputs.