Papers
Topics
Authors
Recent
Search
2000 character limit reached

Execution-Aware MBR Decoding

Updated 17 April 2026
  • Execution-aware MBR decoding is an inference paradigm that redefines utility using runtime execution signals for more accurate candidate selection.
  • It leverages domain-specific tests like program execution outcomes, test passes, and trajectory similarities to determine expected risk.
  • Approximate methods such as Correlated Sequential Halving balance computation and accuracy, demonstrating significant gains in diverse applications.

Execution-aware Minimum Bayes Risk (MBR) decoding is an inference-time decoding and reranking paradigm in which candidate outputs are evaluated and selected according to their expected loss—or "risk"—as measured by signals derived from executing the candidate outputs. Unlike standard likelihood-based decoding, execution-aware MBR leverages the semantics observable through actual candidate execution, such as program outputs, test outcomes, constraint violations, or action-replay consistency, to redefine the utility or loss function used in the MBR framework. This approach has shown significant empirical gains and reliability in domains where syntactic equivalence is insufficient, such as code generation, mathematical optimization, and robotic planning.

1. Theoretical Foundation of Execution-Aware MBR

Minimum Bayes Risk decoding selects, from a set of candidate outputs, the hypothesis with the lowest expected loss under the model’s posterior distribution. For a set of candidates Y={y1,...,yN}\mathcal{Y} = \{y_1, ..., y_N\} sampled from P(y∣x)P(y|x) for input xx, the expected risk of a candidate y′y' is:

R(y′)=Ey∼P(⋅∣x)[ℓ(y′,y)]≈1N∑y∈Yℓ(y′,y)R(y') = \mathbb{E}_{y \sim P(\cdot|x)}[\ell(y', y)] \approx \frac{1}{N} \sum_{y\in\mathcal{Y}} \ell(y', y)

where ℓ(y′,y)\ell(y', y) is a task-specific loss function. Execution-aware MBR redefines ℓ\ell in terms of output agreement measured via execution or other concrete task semantics, instead of surface-level or model-internal metrics.

In classical sequence tasks, MBR is computationally challenging due to the quadratic requirement of computing all pairwise losses (or utilities) between NN samples, especially when execution or neural-based loss functions are used (Jinnai et al., 2024).

2. Execution-Aware Loss Functions and Task Semantics

Execution-aware MBR instantiates the loss function with execution-derived signals tailored to each problem domain:

  • Code Generation: Loss is determined by agreement of execution outputs on held-out inputs. For candidates y,y′y, y', either a binary utility U(y,y′)=1outputs(y)=outputs(y′)U(y, y') = 1_{\mathrm{outputs}(y) = \mathrm{outputs}(y')} (Li et al., 2024), or "hard" and "soft" 0–1 losses on per-test-input outputs (Shi et al., 2022). This semantically clusters programs by functional equivalence detected via execution traces.
  • Optimization Modeling: Risk incorporates composite terms reflecting compilation success, unit test pass rates, objective function agreement (between independent optimizer and simulator implementations), and counts of constraint violations (Song et al., 29 Jan 2026).
  • Robotics / VLA Models: Loss is a function of trajectory feature space distances (e.g., P(y∣x)P(y|x)0 norm over end-effector poses) between sampled action-chunks, capturing behavioral consensus rather than syntactic or probability-based similarity (Ma et al., 5 Jan 2026).

Risk is minimized when a candidate both aligns with the consensus behavior (as seen via execution) and robustly passes concrete functional checks.

3. Practical Algorithms and Approximate Execution-Aware MBR

Exact MBR decoding with expensive execution-based metrics is computationally prohibitive for large P(y∣x)P(y|x)1 due to the P(y∣x)P(y|x)2 pairwise loss computations. Several approximate and efficient algorithms have emerged:

  • Correlated Sequential Halving (CSH): For text generation and neural metrics, CSH adaptively prunes the set of hypotheses by allocating a fixed evaluation budget across rounds, each round halving the candidate pool based on Monte Carlo utility estimates computed against a growing reference set (Jinnai et al., 2024). The resulting Approximate MBR (AMBR) decoder matches exact MBR within P(y∣x)P(y|x)3 COMET points on WMT21 De→En while using only P(y∣x)P(y|x)4–P(y∣x)P(y|x)5 the total evaluations.
  • Early Filtering and Selective Execution: In code generation, cheap trial-unit-test filtering is used to eliminate obviously invalid candidates prior to expensive full-suite execution and pairwise risk estimation (Li et al., 2024).

Additionally, in NEMO’s optimization setting, candidate generation and execution are paired with validator feedback and self-consistency ensembles, further stabilizing outcomes and reducing the search space for MBR (Song et al., 29 Jan 2026).

4. Execution-Aware MBR in Code Generation

Execution-aware MBR is highly effective in program synthesis and code translation tasks:

  • The approach involves generating P(y∣x)P(y|x)6 samples from an LLM, filtering candidates via trial tests, computing output traces on evaluation inputs, and ranking via majority-vote or expected binary agreement (Li et al., 2024, Shi et al., 2022).
  • Empirically, trial test filtering boosts pass@1 performance significantly (e.g., from ~39% to ~60%), while subsequent MBR reranking after filtering further raises accuracy close to the theoretical oracle, e.g., ~70% pass@1 on HumanEval with CodeLlama-7B (Li et al., 2024).
  • Execution-aware MBR consistently outperforms both log-likelihood and neural-metric-based rerankers and exhibits robustness across model scales and problem classes. With only a few (5–20) high-quality unit tests, nearly all diversity-induced gains can be realized.
  • Similar principles support SQL and shell translation via execution-MBR, achieving improvements of 10–15 points in execution accuracy over sampling (Shi et al., 2022).

5. Execution-Grounded MBR in Optimization and Planning

In automated mathematical modeling (NEMO), execution-aware MBR:

  • Samples candidate optimizer implementations (e.g., Python + solver code), executes and validates each in a sandbox, and computes risk as a weighted function of compile/test failures, constraint violations, and objective discrepancies (Song et al., 29 Jan 2026).
  • Candidates failing simulation validation or exhibiting objective mismatches accumulate higher expected risk and are deprioritized.
  • Robustness is further achieved by using an asymmetric validator–optimizer loop—injecting simulation error traces as debugging prompts—and by majority-voting over ensembles of independent runs to filter outlier candidates.
  • In a case study, this approach reliably selects implementations that compile, pass all tests, and meet specification objectives, contributing substantially to NEMO’s state-of-the-art benchmark performance.

6. Execution-Aware MBR in Vision-Language-Action (VLA) Systems

For vision-language-conditioned action generation in robotics (CycleVLA), execution-aware MBR is utilized as a zero-shot, test-time scaling mechanism:

  • At predicted subtask boundaries (e.g., after backtracking), multiple action-chunk candidates are sampled via a stochastic policy.
  • Pairwise distances (e.g., P(y∣x)P(y|x)7 over trajectory features) are computed, and the medoid (i.e., the candidate with minimal total distance to all others) is selected for execution (Ma et al., 5 Jan 2026).
  • This procedure consistently raises the per-chunk probability of success (e.g., from 71.7% to 78.5% in under-trained models, and from 90.2% to 95.5% in fully trained models), corresponding to ∼6 percentage point gains in end-to-end task success.
  • MBR is only applied at specific execution points rather than continuously, exploiting execution context for computational efficiency.

7. Extensions, Limitations, and Practical Considerations

Execution-aware MBR enables robust, consensus-driven output selection but also faces practical and methodological considerations:

  • Computational Budget: Execution-based utility functions are expensive. Approximate MBR algorithms and pre-filtering are crucial for practical deployment (Jinnai et al., 2024, Li et al., 2024).
  • Number and Quality of Tests: In code domains, only a small number of high-quality test cases are required for near-oracle performance, but the discriminative power of test inputs is critical (Shi et al., 2022).
  • Hyperparameter Elimination: Methods such as AMBR are designed to avoid domain-specific hyperparameter tuning, enabling budget-constrained, consistent decoding (Jinnai et al., 2024).
  • Risk Function Design: Composite risk functions, as exemplified by NEMO, accommodate multiple execution-based error signals, offering fine-grained control over robustness (Song et al., 29 Jan 2026).
  • Future Directions: Areas for further development include automated selection of maximally informative test inputs, scalable clustering for very large candidate sets, and end-to-end training with execution-based losses.

Execution-aware MBR decoding represents a cross-domain paradigm shift whereby inference incorporates executable semantics, delivering consistent improvements in accuracy and reliability across text generation, code synthesis, optimization modeling, and robotics (Jinnai et al., 2024, Li et al., 2024, Song et al., 29 Jan 2026, Ma et al., 5 Jan 2026, Shi et al., 2022).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Execution-Aware MBR Decoding.