Coarse-to-Fine Active Perception Pipeline

Updated 17 January 2026

Coarse-to-Fine Active Perception Pipeline is a hierarchical approach that first captures broad, low-cost cues then refines them with domain-aware modules.
It employs iterative refinement, dynamic planning, and modular tool invocation to optimize resource usage, accuracy, and explainability across modalities.
Empirical evaluations in audio, vision, and robotics show enhanced performance over static methods in accuracy, efficiency, and transparent reasoning.

A coarse-to-fine active perception pipeline is a paradigm for intelligent information acquisition, where perception and reasoning are performed in a progressive hierarchy: initial broad, low-cost cues are acquired, then incrementally refined through domain-aware modules and iterative decision-making, until sufficient fine-grained evidence is gathered for task completion or semantic inference. This approach explicitly decouples “coarse” global perception from “fine” task-specific inspection—and leverages dynamic planning to optimize resource usage, accuracy, and explainability. It is instantiated in recent works for audio reasoning, image manipulation localization, multi-modal understanding, robot manipulation, and navigation across vision-language and embodied AI domains.

1. Architectural Foundations and General Principles

The coarse-to-fine active perception pipeline organizes the perception process in hierarchical stages, typically with a multi-agent or modular architecture. Across domains, key principles include:

Decoupled Perception and Cognition: Perceptual input (audio, image, 3D, etc.) is first lifted into a coarse representation, often in the language domain, before higher-level reasoning agents act (Rong et al., 21 Sep 2025).
Iterative Refinement Loop: An active planning agent orchestrates iterative cycles of gap diagnosis, tool invocation, and evidence integration. Stages terminate either by explicit sufficiency or fixed budget (Rong et al., 21 Sep 2025, Tao et al., 29 Dec 2025).
Modular Tool Invocation: Specialized agents or planners dynamically choose among tool-augmented routes (e.g., QA, ASR, segmentation, event localization) to extract evidence tailored to the current reasoning gap (Rong et al., 21 Sep 2025, Tao et al., 29 Dec 2025).
Memory or Contextual Fusion: Feature fusion modules actively recall long-term priors (learned patterns) and fuse them with real-time cues for context-sensitive fine-grained localization or reasoning (Guo et al., 25 Nov 2025).

This hierarchical and proactive organization sharply distinguishes active pipelines from static “single-pass” perception strategies.

2. Instantiations in Audio, Vision, and Multimodal Domains

Recent systems exemplify this paradigm in diverse settings:

Audio Reasoning: AudioGenie-Reasoner (AGR) (Rong et al., 21 Sep 2025) employs five agents (captioner, planning, interaction, augmentation, answerer) to iteratively refine an initial coarse caption of audio into a detailed evidence chain using plug-and-play tools (Audio-QA, ASR, guided recaptioning).

Image Manipulation Localization: BoxPromptIML (Guo et al., 25 Nov 2025) acquires weak box prompts as coarse annotations, generates high-fidelity pseudo-masks via SAM, and distills knowledge into a student model. Its Memory-Guided Gated Fusion Module adaptively integrates multi-scale features and long-term memory for fine segmentation, maintaining efficiency and generalization with minimal annotation cost.

Audio-Visual Understanding: OmniAgent (Tao et al., 29 Dec 2025) exploits audio cues for “coarse” event localization—temporal windows—then selectively allocates higher-resolution QA tools to corresponding video segments for “fine” reasoning, managed via a recursive Think–Act–Observe–Reflect planning loop.

Object Detection: CF-DETR (Shin et al., 29 May 2025) divides detection into coarse-pass processing of large, critical objects with low latency and fine-pass refinement of ambiguous or small regions. A real-time NPFP scheduling framework partitions and batches subtasks to meet strict deadlines for safety-critical operations, boosting accuracy for both critical and overall detection metrics.

3D Robotic Manipulation: ActiveVLA (Liu et al., 13 Jan 2026) conducts coarse critical region localization via multi-view projections and VLM-produced heatmaps, followed by active pose sampling and 3D zoom-in for resolution enhancement in key regions. Integrating these refined views enables superior manipulation precision in complex environments.

Vision-Language Navigation: SLAM-free pipelines (Zhao et al., 25 Sep 2025) fuse hierarchical scene and object-level cues to build a semantic-probabilistic topological map for task-driven navigation. Coarse subgoal selection is conducted via LLM-based reasoning over the topological graph, with fine local planning via vision-based obstacle avoidance.

Large-Scale 3D Scene Reasoning: SpatialReasoner (Zheng et al., 2 Dec 2025) employs supervised cold start to learn spatial tool syntax (“zoom_in”, “render_view”), then reinforcement learning with adaptive exploration reward to efficiently traverse a hierarchical BEV pyramid (floor, room, close-up) for large-scale 3D VQA with minimal tool usage.

3. Core Algorithmic Components and Mathematical Formulation

The fundamental algorithmic structure is a loop alternating coarse observation with fine action:

At $t=0$ : acquire global summary or prompt-driven attention (caption, grid, event list, annotation).
At iteration $i$ , decision agent assesses sufficiency or gap:

$(s_i, H_{i+1}) = \mathcal{F}_{\text{plan}}(Q, L, D_i, H_i)$

If $s_i = \text{Insufficient}$ : formulate a plan $P_i$ for tool invocation.
Augmentation agent executes the plan:

$E_{\text{new}} = \mathcal{F}_{\text{Aug}}(P_i)$

$D_{i+1} = D_i \oplus E_{\text{new}}$

When $s_i = \text{Sufficient}$ or $i$ reaches maximum:

$(A^*, S_c, R) = \mathcal{F}_{\text{answer}}(D_f, Q, L)$

where $A^*$ is the answer, $S_c$ the confidence margin, and $R$ the explicit rationale (Rong et al., 21 Sep 2025).

In vision pipelines, adaptive fusion employs channel-wise gating, long-term memory banks, and weighted residual attention:

$A_{\text{final}} = \alpha(A'_{\text{base}} \odot G_{\text{avg}}) + (1-\alpha)\bar{A}_{\text{mem}}$

$A_{\text{refined}} = F_{\text{fused}} \odot A_{\text{final}} + F_{\text{fused}}$

(Guo et al., 25 Nov 2025).

CF-DETR’s scheduler optimizes critical recall and accuracy under hard latency constraints:

$\max_{ \{ x_i \} } \sum_{i=1}^n [ R_i^S + \Delta R_i^F x_i ] \quad \text{s.t.} \quad \sum_{i=1}^n ( L_i^S + x_i L_i^F ) \leq D_{\text{frame}}$

(Shin et al., 29 May 2025).

SpatialReasoner (Zheng et al., 2 Dec 2025) uses adaptive exploration reward to balance exploration vs. redundancy as a multi-term RL objective.

4. Tool-Augmented Routes and Evidence Integration

Coarse-to-fine pipelines depend on modular, callable routes:

Audio: Guided recaptioning, question answering, ASR (Rong et al., 21 Sep 2025, Tao et al., 29 Dec 2025).
Vision: SAM-based masks, Tiny-ViT feature fusion, multi-level grid overlays, pose selection (Guo et al., 25 Nov 2025, Liu et al., 13 Jan 2026, Sripada et al., 2024).
Topological and Spatial Tools: “zoom_in” and “render_view” commands for hierarchical exploration (Zheng et al., 2 Dec 2025, Zhao et al., 25 Sep 2025).
Scheduling and Batch Processing: Selective triggering, region partitioning, optimal batching (Shin et al., 29 May 2025, Ginargiros et al., 2023).

Evidence is incrementally woven into a textual, topological, or embedding-chain domain, ensuring all agents operate and reason in compatible formats.

5. Empirical Evaluation and Benchmarking

Systematic quantitative evaluations demonstrate the benefits:

AudioGenie-Reasoner (Rong et al., 21 Sep 2025): 72.60% accuracy on MMAU-mini; 58.85% on MMAR; ablations confirm 9–17 point drops without iterative refinement; SOTA over best open-source and proprietary audio reasoning.
BoxPromptIML (Guo et al., 25 Nov 2025): 0.619 (IND F1) with weak annotation vs 0.648 (fully-supervised); OOD F1 0.285; memory/gating ablations show nontrivial accuracy losses.
CF-DETR (Shin et al., 29 May 2025): Near-max critical mAP (≥96%) and mAP gains of +6–20% over DNN-SAM; firm real-time guarantees on AVs; batch-level speedups.
ActiveVLA (Liu et al., 13 Jan 2026): 91.8% RLBench success; 65.9% COLOSSEUM; 51.3% GemBench; ablations show active view selection and zoom-in essential for optimal performance.
OmniAgent (Tao et al., 29 Dec 2025): 82.71% Daily-Omni QA accuracy (vs. 10–20 point lower baselines), 2× efficiency in token usage and runtime.
SpatialReasoner (Zheng et al., 2 Dec 2025): 0.682 overall accuracy (4.12 tool calls) vs 0.588–0.652 and 16+ images for passive baselines.
AP-VLM (Sripada et al., 2024): 5/5 semantic query success in complex scenes vs 0/5 for passive/fixed-camera baselines.

6. Design Principles, Limitations, and Extensions

Robust coarse-to-fine pipelines adhere to several recurring design patterns:

Explicit separation of sensing (coarse bias) and reasoning (cognitive refinement), enabling plug-and-play across diverse tasks (Rong et al., 21 Sep 2025).
Diagnose → Plan → Act cycle with agent modularity for gap detection, route selection, and evidence collection (Tao et al., 29 Dec 2025).
Active querying (audio localization, grid overlays, spatial tool invocation) rather than passive feedforward pipelines (Zheng et al., 2 Dec 2025, Liu et al., 13 Jan 2026).
Textual or feature-space evidence integration for unified reasoning (Guo et al., 25 Nov 2025, Rong et al., 21 Sep 2025).
Termination by sufficiency or bounded iteration budget, balancing thoroughness and computational cost (Rong et al., 21 Sep 2025, Zheng et al., 2 Dec 2025).

Identified limitations include discretization of orientations or region proposals, constraint to axis-aligned rectangles, sim-to-real generalization gaps, compute intensity for RL training, and need for richer uncertainty estimation or continuous tool parameterization (Guo et al., 25 Nov 2025, Zheng et al., 2 Dec 2025, Zhu et al., 27 May 2025).

Extension avenues are articulated: adaptive grid resolutions, multi-view fusion, continuous orientation or spatial search (e.g., quaternion space), few-shot RL, information-theoretic planning, and extension to medical, video, or high-density domains.

The coarse-to-fine active perception paradigm has driven substantial advances in state-of-the-art accuracy, sample efficiency, and explainability across audio, vision, and embodied AI tasks. Empirical results confirm its superiority to passive or static single-pass models, especially in cases of weak annotation, resource constraints, real-time deadlines, or large-scale exploration.

Continued research is focused on scaling these paradigms to broader domains, improving RL sample efficiency, enriching agent modularity, and generalizing to multi-agent, interactive task settings. The explicit demonstration of coarse-to-fine reasoning and evidence chain construction is recognized as instrumental in both practical deployment and in building transparent, interpretable AI systems in complex environments.