VLM-Based Failure Predictor & Planner

Updated 19 February 2026

VLM-Based Failure Predictor and Planner is a system that fuses visual perception with symbolic reasoning to detect anomalies and generate recovery strategies in robotics.
It leverages multimodal deep learning with visual encoders and cross-modal transformers to quantify uncertainty and update behavior policies in real time.
Experimental studies show significant improvements in recovery success and task robustness, highlighting its potential for adaptive, failure-aware robotic planning.

A Vision-LLM (VLM)-Based Failure Predictor and Planner is a robotics system component that leverages multimodal neural architectures to detect, diagnose, and recover from failures during task execution by fusing visual perception with symbolic and linguistic reasoning. This paradigm seeks to transcend the limitations of preprogrammed failure dictionaries by enabling open-set anomaly detection, root-cause analysis, failure-aware planning, and online expansion of reactive behavior policies, yielding robust performance across complex manipulation and planning domains.

1. System Architecture and Core Workflow

VLM-based failure prediction and planning frameworks typically integrate several tightly coupled modules, constituting a perception-to-action loop that can adapt online to unforeseen anomalies (Ahmad et al., 2024, Ahmad et al., 19 Mar 2025, Zeng et al., 2 Dec 2025). The canonical architecture consists of:

Sensing and Perception: Cameras and auxiliary sensors provide streaming RGB(D) imagery $I_t$ and optional scene state information (e.g., scene graphs).
VLM-Based Predictor/Generator: Visual encoders (ViT/CNN) process images into feature embeddings ( $V_t$ $V_{t}$ ), which are fused via cross-modal transformers with:
- Symbolic representations of current plans or behavior tree structures,
- Skill library encodings, and
- (Optionally) structured scene graphs or execution histories. The VLM outputs:
- A soft failure likelihood score ( $P(\text{failure} | I_t, \text{BT}, \mathcal{S})$ ),
- Natural language root-cause explanations,
- Structured templates for missing conditions (preconditions/postconditions), and/or recovery skill descriptors.
Reactive/Proactive Planner: Upon failure detection:
- Modifies the underlying execution policy (e.g., Behavior Tree [BT] or symbolic plan) by injecting new condition nodes or skill branches derived from VLM suggestions.
- May operate in both pre-execution (predictive) and runtime correction (reactive) modes.
Behavior Policy and Executor: The updated plan or BT drives the robot’s actuation interface, closing the loop by providing fresh data to the sensing layer.

The forward cascade is visualized as:

[Camera/Image Preprocessor] → [VLM Predictor & Generator] → [Planner/BT Updater] → [Policy Executor] → Robot (Ahmad et al., 2024).

2. Failure Detection, Uncertainty Quantification, and Anomaly Diagnosis

At the core of the approach is the rigorous detection of anomalous states at perception and execution time:

Failure Prediction: The VLM encodes visual input together with plan/skill structure and outputs a calibrated soft score or failure class via an MLP or transformer head:

$P(\text{failure} \mid I_t, \mathrm{BT}, \mathcal S) = f_{\text{VLM}} (\phi(I_t),\,\mathrm{encode}(\mathrm{BT}),\,\mathrm{encode}(\mathcal S))$

This score is thresholded against a preset $\tau$ to trigger the correction pipeline (Ahmad et al., 2024, Ahmad et al., 19 Mar 2025).

Uncertainty Quantification (UQ): Some frameworks, such as ViLU, explicitly train a post-hoc binary classifier on multi-modal representations to predict the probability of prediction error (misclassification), providing fine-grained failure-probability estimates ( $\hat y_{\text{unc}}$ ) for each sample. The UQ module integrates visual embedding ( $v$ ), text embedding ( $t_p$ ), and image-conditioned attention summaries ( $t_c$ ), with a classifier trained via a weighted binary cross-entropy objective (Lafon et al., 10 Jul 2025).
Anomaly Diagnosis: Upon failure detection, the VLM generates free-form or structured natural language explanations describing the missing enabling conditions, skills, or root causes, thereby supporting both autonomous and human-in-the-loop debugging (Ahmad et al., 2024, Duan et al., 2024, Zeng et al., 2 Dec 2025).

3. Automated Generation and Integration of Recovery Policies

A key innovation is the automated synthesis and injection of recovery strategies into the robot’s task execution logic:

Template Generation: Upon anomaly identification, VLMs synthesize:
- Missing Condition Triplets ( $C = \text{(name, context, type)}$ ), e.g., (hole_free, insert_skill, precondition)
- Skill Templates: Structured descriptors containing skill names, parameters, pre/postconditions encapsulated in JSON-like schemata.

The template search is formalized as:

$\mathrm{missing\_conds} = \arg\max_C\, P(C\mid \text{anomaly}, \mathrm{BT}), \quad \mathrm{missing\_skills} = \arg\max_S\, P(S\mid \text{anomaly}, \mathcal S)$

(Ahmad et al., 2024).

Policy Update: The reactive planner traverses the current BT or symbolic plan to:
- Insert new precondition nodes (e.g., sequencing Condition-Node in front of a failed skill node),
- Attach new recovery skill branches at fallback/selector nodes,
- Replan at the symbolic/state level if majority preconditions/effects are unsatisfied (Ahmad et al., 2024, Ahmad et al., 19 Mar 2025, Zhang et al., 2023).
Actionable Guidance: Advanced systems (e.g., ViFailback-8B) augment textual correction with visual symbol generation—arrows, icons—that are overlaid onto camera views and digested by downstream low-level policies, achieving real-time recoveries in manipulation tasks (Zeng et al., 2 Dec 2025).

4. Planning Integration: Symbolic, Reactive, and Risk-Aware Paradigms

Several system instantiations demonstrate the breadth of planning frameworks that can be augmented with VLM-based failure prediction:

Behavior Tree (BT) Approaches: VLM output drives direct modification (node insertion/restructuring) of the BT running policy. Both pre-execution verification and reactive runtime repair are supported, with scene graphs and execution histories enhancing context-awareness (Ahmad et al., 2024, Ahmad et al., 19 Mar 2025).
Classical Symbolic Planning: In TPVQA, VLMs map symbolic preconditions/effects to natural-language queries and VQA answers (“yes/no”). Precondition failures trigger online re-planning, effects failures trigger action retry (Zhang et al., 2023). This formally closes the perception-action loop that classical planners traditionally lack.
Dual-VLM for Visual Planning: In VLMFP, a SimVLM predicts step-by-step success/failure given images/rules/actions, and a GenVLM generates/iteratively revises PDDL files, automating both visual-state simulation and formal plan rule synthesis via model-to-model feedback (Hao et al., 3 Oct 2025).
Risk-Aware Planning: UQ outputs from ViLU or similar can be injected as risk-penalties or as observation-noise models in POMDP or beam search planners, enabling failure-averse decision-making (Lafon et al., 10 Jul 2025).

Framework	Failure Trigger	Policy Update Mechanism
BT+VLM	Failure score > τ	Node/subtree injection
TPVQA (PDDL+VQA)	Majority precond fail	Online replan, action retry
ViFailback (VLM+VLA)	VQA/symbol diagnosis	Visual/textual corrective cues
VLMFP (SimVLM/GenVLM)	SimVLM predicts fail	Refine PDDL via bi-model loop
ViLU (UQ)	High failure prob	Risk-modulated action selection

5. Experimental Results, Quantitative Gains, and Comparative Analysis

Multiple studies have empirically validated the efficacy of VLM-based failure prediction and planning:

Closed-Loop Recovery in BT Frameworks: 100% recovery success in previously unseen failure cases across multiple manipulation tasks, outperforming LLM-only baselines that lack visual context (e.g., LLM-only: 30–60% in hard obstacle scenarios; VLM: 100%) (Ahmad et al., 2024).
Fine-Grained Failure Reasoning Benchmarks: ViFailback-8B achieves 93.70% closed-ended and 72.64% open-ended benchmark scores, with symbol-guided recovery pipelines boosting real-world robot success rates by 21–24% over base policies (Zeng et al., 2 Dec 2025).
Generalization Across Domains: Systems such as AHA and Dual-VLM frameworks exhibit generalization to real robots, unseen visual appearances, and modified domain rules, with absolute improvements versus strong VLM baselines (AHA-13B: 78.2% average, +6.4% over GPT-4o-ICL; VLMFP: 70.0% valid plans for unseen instances) (Duan et al., 2024, Hao et al., 3 Oct 2025).
Ablation Studies: Criticality of VLM-driven reasoning is highlighted—removing scene graphs or execution history context reduces task success rates by 5–8 pp, and omitting structured pre-execution or reactive VLM checks drops overall recovery rates by over 10 pp (Ahmad et al., 19 Mar 2025).

6. Model Design, Training, and Practical Integration

State-of-the-art VLM-based failure predictors share several architectural and training design patterns:

Multimodal Backbones: Transformers with vision (ViT/CNN), text encoders, and fusion layers, with optional heads for task-specific outputs (classification, generation, symbol-code).
Joint Objectives: Training mixes failure detection (classification), recovery action generation (regression or token-wise generation), and auxiliary tasks (e.g., visual symbol alignment). Weighted loss functions allow for multi-task fine-tuning (e.g., $\mathcal{L} = \lambda_1\mathcal{L}_{\text{diag}} + \lambda_2\mathcal{L}_{\text{sym}} + \lambda_3\mathcal{L}_{\text{gen}}$ ) (Zeng et al., 2 Dec 2025).
Failure Data Generation: Data for supervised learning is constructed via systematic perturbation of expert demonstrations (FailGen, procedural simulation in RLBench/Maniskill/gridworlds), producing labeled failure cases and associated recovery actions at sub-task or keyframe granularity (Duan et al., 2024, Lin et al., 2 Oct 2025, Hao et al., 3 Oct 2025).
Policy and Planner Coupling: In real-time settings, policies are designed to accept mid-execution BT/plan modifications, and may interface with symbol-generation outputs for vision-guided actuation adjustment (Zeng et al., 2 Dec 2025, Mei et al., 2024).

Design constraints and tradeoffs—such as VLM inference latency (typically 150 ms–2 s/query), risk-vs-speed threshold tuning, and the modularity of scene graph/history context—govern deployment in real-world or closed-loop scenarios (Ahmad et al., 19 Mar 2025, Mei et al., 2024).

7. Scope, Limitations, and Future Directions

VLM-based failure prediction and planning frameworks have demonstrated the ability to:

Generalize recovery strategies to complex, visually novel, and unenumerated failure modes,
Integrate with both BT/reactive and symbolic/classical planners,
Drive improvements in sample efficiency, recovery speed, and task success across physical and simulated robots.

Notable current limitations include reduced performance under severe occlusion or high-dimensional dynamic relations, reliance on visual perception quality, and (in some cases) constraints on long-horizon symbolic reasoning (mitigated via VLM–PDDL hybrid methods). Real-world transferability from simulation-grown failure datasets remains an active research area, as does the reduction of inference latency and the integration of tactile/multimodal sensors for richer anomaly detection (Ahmad et al., 2024, Ahmad et al., 19 Mar 2025, Lin et al., 2 Oct 2025, Hao et al., 3 Oct 2025).

Emerging directions include proactive failure anticipation before error manifestation (Ma et al., 5 Jan 2026), closed-loop active uncertainty reduction (Lafon et al., 10 Jul 2025), self-improving memory integration (Wu et al., 27 May 2025), and automated synthesis of formal domain rules for planning from visual input (Hao et al., 3 Oct 2025). These trends collectively move toward robust, adaptive robotic autonomy enabled by deeply coupled vision-language-reasoning systems.