Control-Grounded Visual Evidence Budgeting

Updated 4 June 2026

The paper introduces control-grounded visual evidence budgeting as a framework that explicitly allocates visual cues via token-level, frame-level, or object-level gating to reduce spurious correlations and hallucinations.
It employs strategies like dual-stream fusion, attention routing, and reinforcement learning to balance model accuracy with stringent resource constraints.
Empirical results demonstrate significant improvements, such as maintaining over 98% task accuracy with reduced memory footprint and up to 85% reduction in hallucination rates.

Control-grounded visual evidence budgeting is a family of mechanisms and architectural strategies that explicitly regulate the amount and type of visual evidence accessible to a vision-language or vision-language-action model, conditioning model outputs on budgeted, high-fidelity visual cues rather than unbounded or undifferentiated visual context. This paradigm targets two intertwined objectives: maximizing reasoning faithfulness (i.e., grounding outputs strictly in verifiable visual evidence) and optimizing computational efficiency (i.e., controlling the volume, diversity, or saliency of processed visual inputs). Solutions implement this budgeting at various granularities—per-token, per-frame, per-object, or per-modal-source—using explicit gates, budgeted selectors, attention routing, or dual-stream fusions. These techniques are prominent in contemporary vision-language reasoning, spatio-temporal understanding, video-based question answering, streaming geometry models, and visual policy learning for embodied agents.

1. Theoretical Foundations and Motivation

Control-grounded visual evidence budgeting arises from the observation that unrestricted access to all visual context in complex settings (images, videos, video streams) often induces spurious correlations, hallucination, and inefficiency in language generation and multimodal reasoning. Foundational work establishes that the key challenge is to algorithmically and adaptively allocate a limited “budget” of visual evidence—where “budget” refers either to memory footprint (e.g., token count), representational support (e.g., geometric prototypes), or routing bandwidth (e.g., selected RoIs)—such that the model preserves both informativeness and faithfulness (Bangde et al., 28 Apr 2026, Xu et al., 8 Mar 2026).

Formally, budgeting is cast as a constrained optimization balancing adherence to downstream performance (e.g., accuracy or action success) with tight resource caps, for instance: $\max_{\text{evidence set}}\,\text{Task-Utility}(\text{selected evidence})\quad\text{s.t.}\quad|\text{evidence}| \leq B$ where $B$ is the prescribed budget. In tasks such as streaming 3D understanding, budgeting is necessary to maintain stability and coverage of spatial representations as the input sequence grows unbounded (Xu et al., 8 Mar 2026). In generative language grounding, budgeting mitigates language-prior dominance and reduces hallucination rates, especially under ambiguous or visually impoverished scenarios (Bangde et al., 28 Apr 2026). Reinforcing budget adherence through explicit loss terms, cross-stream gates, or reward shaping further grounds model outputs in physically or perceptually available evidence.

2. Decoding and Fusion Mechanisms

A principal instantiation of control-grounded budgeting in vision-LLMs is the dual-distribution stream and contrastive fusion approach exemplified by Instruction–Evidence Contrastive Dual-Stream Decoding (IECD²) (Bangde et al., 28 Apr 2026). At each generation step $t$ , IECD² constructs:

An instruction-driven token distribution, $p_t^{(I)}(v)$ , capturing the output distribution conditioned on an open-ended instruction prompt.
An evidence-driven token distribution, $p_t^{(E)}(v)$ , conditioned on a visual-evidence-only prompt.

The token-level fusion is

$p_t^{(\mathrm{IECD}^2)}(v)\propto \bigl(p_t^{(I)}(v)\bigr)^{g_t} \bigl(p_t^{(E)}(v)\bigr)^{1-g_t}$

where the gating scalar $g_t$ is computed adaptively via the symmetric bidirectional KL divergence, serving to upweight the evidence stream as the distributions diverge (i.e., when tokens are not mutually supported), effectively “budgeting out” hallucinated or unsupported content.

This dynamic fusion operationalizes a soft, content-aware visual evidence budget, ensuring that highly fluent but weakly grounded tokens are suppressed in favor of those underpinned by the image. The method is strictly prompt- and decoding-level appended, requiring no model retraining.

3. Budgeting Architectures in Streaming and Video Reasoning

Budgeting methodologies in video and streaming domains diverge structurally from dual-stream fusion, leveraging explicit evidence selection modules, resource-allocators, and memory banks.

3.1 Evidence Grounding Modules and Anchoring

The Chain of Evidence (CoE) framework (Huang et al., 12 Jan 2026) introduces a lightweight Evidence Grounding Module (EGM) that, given a sequence of $N$ video frame features and a query, projects the query into $K\ll N$ "evidence queries." These are used in a cross-attention block to select $K$ high-fidelity frames, forming an explicit temporal evidence budget. The EGM is supervised both for anchor fidelity (via BCE loss on frame selection vs. ground truth) and downstream output correctness, and evidence-budget enforcement is either explicit (fixed $B$ 0) or softly penalized (additional Lagrangian terms).

A composite reward, integrating correctness, anchor alignment, and reasoning-process fidelity, is used in reinforcement learning for process-aligned policy shaping. Empirical ablation demonstrates that even extreme sparsification ( $B$ 1 vs. all frames) maintains over 98% of task accuracy at drastically reduced inference costs.

3.2 Hierarchical and Token-Level Budgeting

Techniques such as Triage (Wang et al., 30 Jan 2026) employ a two-stage, greedy resource allocator:

Frame-Level Budgeting: Frames are scored based on scene change, motion intensity, and query relevance. Temporal bucketing guarantees diversity. The allocated keyframes’ scores serve as priors for token-level selection.
Token-Level Budgeting: Tokens within selected frames are divided between core (most relevant) and context (diverse background or relay tokens), employing Maximal Marginal Relevance (MMR) for diversity.

This framework provides fine-grained control over both temporal and spatial evidence allocation. The adaptive prior ensures that the model does not concentrate on temporally narrow events, and the MMR-based context phase ensures local visual diversity while respecting per-frame budgets.

3.3 Streaming Memory and Geometric Support

FrameVGGT (Xu et al., 8 Mar 2026) addresses streaming 3D perception where unlimited KV-cache growth is infeasible. Frames’ KV tokens are grouped as evidence blocks, summarized into prototypes, and retained in a rolling fixed-capacity memory via a k-center covering strategy over key-space cosine dissimilarity. This directly budgets both the number of frame representations and their geometric diversity, anchoring retention in local geometry coherence.

Empirical results verify strong trade-offs: FrameVGGT achieves superior 3D accuracy with less than half the memory footprint of full-token-streaming baselines.

4. Role-Aware and Adaptive Token Selection

Recent approaches introduce role-aware and token-adaptive evidence budgeting at a finer scale.

SemVID (Li et al., 5 Mar 2026): Designed for video temporal grounding, SemVID defines “Evidence Retention” and “Connectivity Strength” as metrics for budgeting. Per-frame token allocations blend query relevance and inter-frame transitions (via $B$ 2-mixing). Tokens are classified into roles—object, motion, and context—using scoring and MMR, guaranteeing that critical cues (e.g., event boundaries) and cross-frame attention chains persist despite aggressive pruning. This role-aware evidence chain enables the model to preserve semantic and temporal information flow, yielding near-oracle performance with <15% token retention.
GeoWeaver (Miao et al., 21 May 2026): GeoWeaver budgets geometric evidence per visual token using a learned router over a multi-level geometry bank. Each token selects the top- $B$ 3 geometry layers (default $B$ 4) as most pertinent, forming a token-adaptive, budgeted geometric representation entered via residual fusion. This contrasts with uniform fusion (all layers), which dilutes structural cues and inflates computation. Empirically, GeoWeaver demonstrates large spatial reasoning gains with minimal overhead, and ablations confirm $B$ 5 is optimal.

5. Budgeted Visual Grounding for Control and Policy Learning

Explicit evidence budgeting mechanisms are now integrated into vision-language-action (VLA) models and policy learning pipelines.

S2 (See Less, Specify More) (Wu et al., 1 Jun 2026): S2 imposes a learned, differentiable “soft” visual evidence budget per camera view and timepoint. Each image patch is gated by a temperature-scaled sigmoid MLP, with a global budget penalty matching average keep rates $B$ 6 to fixed targets $B$ 7. Both ungated and gated paths are jointly supervised to preserve control behavior under the compressed input. This budget bottleneck, enforced simultaneously during training and inference, forces the policy to extract a minimal sufficient statistic from the visual field, reducing distraction and supervision aliasing. Empirical results on both simulated and real-robot tasks confirm that evidence-budgeted policies substantially outperform dense-evidence baselines, especially under domain perturbations.

6. Object-Centric and Routing-Based Evidence Budgeting

Object-centric routing frameworks have emerged for multi-image grounded reasoning.

ROVER (Lv et al., 27 May 2026): ROVER attaches an explicit, constant-length (three-token) visual evidence budget to each object grounding step. When the base LLM issues a grounding pattern, ROVER generates an RoI summary, distills intra-image cues via object-centric differential attention, and integrates historical context through a cross-attention “weave”. Evidence tokens are appended to the language stream and budgeted independently of box or image size, tightly capping decoding overhead. The entire selection and routing process is end-to-end differentiable; empirically, ROVER achieves both accuracy and substantial computational gains over RoI-feature injection alternatives by ensuring evidence use remains within a hard rolling budget.

7. Generalization, Efficiency, and Future Directions

Control-grounded visual evidence budgeting methods deliver concurrent improvements in reasoning faithfulness, robustness to distractors, and efficiency. They reduce hallucination rates—e.g., MS-COCO CHAIR $B$ 8 reduced 85% under IECD² (Bangde et al., 28 Apr 2026)—and sustain accuracy at inference budgets 4–8x smaller than unbudgeted processing (Huang et al., 12 Jan 2026, Li et al., 5 Mar 2026). Hard and soft-budgeted approaches are applicable to streaming, video, 3D, control, and object-centric settings, and most admit plug-and-play integration with existing models.

Open research directions include learning dynamic, per-task or per-instance budgets; role-aware allocation beyond semantic class (e.g., saliency, uncertainty, or control-theoretic sufficiency); extension to multi-modal streams (audio, point cloud); and tight coupling between selector modules and downstream reasoning loss. Robust optimization under high-level constraints, especially in RL settings, suggests ongoing value in explicit, process-aligned evidence anchoring protocols.

The corpus indicates broad consensus that allocating, gating, or routing visual evidence according to adaptive, query-aware, or geometry-driven control signals is essential for scalable, robust, and faithful vision-language reasoning.