Coarse-to-Fine Reasoning in MLLMs

Updated 17 November 2025

Coarse-to-fine reasoning is a hierarchical process that refines broad, global outputs into detailed, context-rich representations in multi-modal models.
Techniques involve multi-stage visual grounding, attention reweighting, chain-of-thought decomposition, and precise region segmentation.
Empirical studies show enhanced accuracy and efficiency in visual, segmentation, and video reasoning tasks across diverse benchmarks.

Coarse-to-fine reasoning in multi-modal LLMs (MLLMs) denotes a hierarchical processing paradigm where broad, contextually coarse representations or responses are initially generated and then iteratively refined into fine-grained, detail-rich outputs. This approach has emerged as a core strategy across vision-language and sequential reasoning tasks, directly motivated by the structural and perceptual challenges unique to high-dimensional multi-modal data and the non-trivial alignment of language with perceptual evidence.

1. Principles and Motivation

Coarse-to-fine reasoning addresses several core limitations in MLLMs: ambiguous perceptual grounding, hallucinations in logical inference, and insufficient focus on task-relevant fine details—especially where crucial information may be highly localized or temporally subtle. In computer vision contexts, model failures often stem from restricted spatial resolution or perceptual acuity, which limit attention to small, informative regions and lead to distraction by irrelevant or salient (but non-causal) features (Wang et al., 22 Dec 2024, Qiang et al., 6 Aug 2025). In sequential and logical reasoning, a monolithic reasoning trace can produce redundancy and mask errors until their accumulation yields incorrect outcomes (Hu et al., 23 Jan 2025).

The coarse-to-fine paradigm thus proceeds in two or more explicit stages: first, a global or high-level representation is computed—such as a broad scene embedding, a preliminary region of interest, a summary reasoning outline, or a set of answer candidates; subsequent modules or refinement steps then operate conditionally, narrowing focus via feature or attention reweighting, explicit segmentation, or chain-of-thought decomposition to deliver precision and fidelity at the desired granularity (Wang et al., 22 Dec 2024, Wang et al., 18 May 2025, Zhang et al., 24 Oct 2025).

2. Computational Frameworks

A broad taxonomy of coarse-to-fine MLLM implementations can be summarized as follows:

Stage Structure

Coarse Stage: Initial phase generates a global context or a weakly localized focus region. This may involve producing bounding boxes in images (Wang et al., 22 Dec 2024), context embeddings (Qiang et al., 6 Aug 2025), or high-level abstract meaning representations (AMR) for text (Nguyen et al., 2023).
Fine Stage(s): Subsequent modules or inference routines apply fine-grained attention mechanisms, segmentation masks, chain-of-thought stepwise refinement (Wang et al., 18 May 2025), multi-agent feedback (Chen et al., 18 Sep 2024), or pixel/object/path-detailed localization (Han et al., 22 Oct 2025, Zhang et al., 24 Oct 2025).

Optimization Regimes

Supervised Fine-Tuning (SFT): Models are first trained to replicate detailed chain-of-thought traces or evidence-focused region proposals from high-quality data (Wang et al., 18 May 2025).
Reinforcement Learning (RL): Policies are further optimized using reward structures designed to penalize hallucination and reward visual or logical consistency at both coarse and fine levels (Wang et al., 18 May 2025, Zhang et al., 24 Oct 2025).
Inference-Time Strategies: Some approaches (e.g., CoF) modify attention maps or region weighting at inference without any weight updates, offering model-agnostic compatibility (Wang et al., 22 Dec 2024).

Representative Algorithmic Patterns

Two-stage visual grounding and attention reweighting (Wang et al., 22 Dec 2024).
Chain-of-thought prompting with explicit coarse-to-fine step decomposition (Nguyen et al., 2023).
Iterative refinement for “hard” cases, triggered by reward model heuristics, in multi-agent and reward-feedback architectures (Chen et al., 18 Sep 2024).
Attention map fusion and multi-modal region segmentation via decomposed analysis (Han et al., 22 Oct 2025).
Hierarchical reward modeling at multiple step granularities (Hu et al., 23 Jan 2025).

3. Applications and Empirical Effects

Coarse-to-fine techniques have demonstrated significant empirical gains and increased sample-efficiency across a range of multimodal and sequential reasoning tasks:

Visual Reasoning: On visual benchmarks emphasizing subtle clues (mean clue area ≈0.25% of the image), coarse-to-fine architectures yield higher clue-coverage, answer accuracy, and reasoning consistency than single-resolution or monolithic methods (Qiang et al., 6 Aug 2025).
Segmentation: In high-resolution images with tiny objects, multi-stage reasoning-segmentation pipelines substantially outperform baseline MLLMs, especially where attention must be focused post hoc through region-proposals or attention-guided SAM-style mask refinement (Zhang et al., 24 Oct 2025, Han et al., 22 Oct 2025).
Video Reasoning: For complex, temporally-extended video QA, training pipelines combine cognition-inspired, “blind” preliminary logic with visual refinement stages followed by SFT and RL with semantic-consistency rewards, delivering state-of-the-art results on six video reasoning datasets (Wang et al., 18 May 2025).
Mathematical and Logical Process Rewards: Hierarchically merging and then refining reasoning steps improves process reward models, reducing redundancy and enhancing both stepwise and global accuracy metrics (Hu et al., 23 Jan 2025).
Language Understanding: In multi-domain natural language understanding (NLU), decomposing reasoning into coarse intent identification followed by fine slot/type labeling and logic form assembly improves exact match rates and robustness in zero/few-shot settings (Nguyen et al., 2023).
Efficiency and Robustness: Selective coarse-to-fine application enables dynamic allocation of computational resources—easy instances are solved quickly, while only hard instances receive expensive, iterative attention (Chen et al., 18 Sep 2024).

4. Technical Realizations

Attention Reweighting: Mechanisms such as cross-attention logit boosting inside decoder layers (adding log(λ) to tokens within the coarse region) increase the model’s sensitivity to relevant spatial features (Wang et al., 22 Dec 2024).

Region Proposal to Segmentation: Coarse masks derived from attention rollout or bounding box generation are refined using segmentation modules (e.g., attention-guided SAM2) for boundary-accurate, pixel-level object masks (Han et al., 22 Oct 2025, Zhang et al., 24 Oct 2025).

Reward Design: Hierarchical reward schedules allow curriculum learning, starting from macro-outcomes and drilling to detailed step/process supervision (e.g., combining format, accuracy, and visual-semantic consistency rewards) (Wang et al., 18 May 2025, Hu et al., 23 Jan 2025).

Chain-of-Thought Decomposition: Reasoning steps are ordered from semantic abstraction (e.g., intent, AMR) to fine-grained slot and type extraction, aligning model inference with the data’s inherent granularity and reducing hallucinations (Nguyen et al., 2023).

Multi-Agent Feedback: Selective iterative refinement, guided by external reward models that distinguish outcome versus process quality, ensures that only ambiguous or erroneous outputs are subjected to reviewer-refiner loops (Chen et al., 18 Sep 2024).

5. Summary of Benchmark Outcomes

A selection of coarse-to-fine MLLM results, drawn directly from recent studies:

Task/Benchmark	Method	Improvement (%)
Video Reasoning	VideoRFT (Wang et al., 18 May 2025)	+5.0 → +7.2 pts over base
Fine-Grained Image	CoF (Wang et al., 22 Dec 2024)	Sum metric +56.6 (LLaVA-7B+CoF)
High-Resolution Seg	FineRS (Zhang et al., 24 Oct 2025)	gIoU/cIoU = 55.1/46.5 on small objects
Evidence Reasoning	VER-Bench (Qiang et al., 6 Aug 2025)	Closed-source peak 76.8%, open-source peak 60.8%
Process RM (math)	CFPRM (Hu et al., 23 Jan 2025)	+0.5–3.4 absolute pts BoN@64

Ablation and component studies systematically confirm that joint application of coarse and fine stages surpasses either alone across domains (Wang et al., 18 May 2025, Wang et al., 22 Dec 2024, Nguyen et al., 2023). Semantic alignment, reasoning-consistency, and clue-coverage losses are identified as driving further gains (Qiang et al., 6 Aug 2025, Wang et al., 18 May 2025).

6. Open Issues and Prospective Directions

While coarse-to-fine reasoning delivers measurable improvements in task accuracy, sample efficiency, and interpretability, challenges and opportunities persist:

Grounding Failures: Coarse stages are still limited by the underlying model’s grounding; failures to localize in cluttered scenes or with low-contrast clues remain bottlenecks (Wang et al., 22 Dec 2024, Qiang et al., 6 Aug 2025).
Binary vs. Soft Region Masks: Most current implementations use binary rather than soft or probabilistic foci; future enhancements may leverage soft attention masks or segmentations learned end-to-end (Wang et al., 22 Dec 2024).
Automated Granularity Selection: Merging and refinement schedules are often heuristically set; joint learning or meta-learning the optimal granularity remains largely unexplored (Hu et al., 23 Jan 2025).
Curriculum Learning and Loss Design: There is increasing interest in curriculum regimes that sequence from global (coarse) to local (fine-coverage/consistency) objectives, which may yield further performance lifts (Qiang et al., 6 Aug 2025).
Modality Generalization: The coarse-to-fine paradigm appears broadly extensible to code, text, and other sequential domains; curriculum or process-reward methods may generalize beyond visual modalities (Hu et al., 23 Jan 2025, Nguyen et al., 2023).

7. Conclusion

Coarse-to-fine reasoning in MLLMs unifies architectural, algorithmic, and training curricula for stepwise enhancement of reasoning quality and perceptual grounding. By structuring inference, optimization, and evaluation from global abstractions to local detail, and by coupling these with hierarchical rewards and dynamic sample allocation, this approach matches or surpasses the state-of-the-art in knowledge-intensive and perception-critical settings in both open- and closed-source MLLMs (Wang et al., 18 May 2025, Wang et al., 22 Dec 2024, Qiang et al., 6 Aug 2025, Zhang et al., 24 Oct 2025). The paradigm provides a structured roadmap for future model development, with clearly separable performance targets and a suite of extensible techniques across modalities and domains.