Coarse-to-fine Attention Fusion

Updated 4 July 2026

Coarse-to-fine attention fusion is a strategy that first computes coarse representations to guide subsequent fine-grained processing, enhancing efficiency and accuracy.
It employs hierarchical, multi-scale, and cross-modal techniques to integrate features in systems such as image markup generation, vehicle re-identification, and video summarization.
This approach balances computational cost and performance through structured attention schemes and multi-level supervision in deep learning models.

Coarse-to-fine attention fusion denotes a family of attention and feature-integration schemes in which a model first constructs a coarse representation—such as a support region, a class/type hypothesis, a global token, a routed subset of regions, or a high-level contextual feature—and then uses that representation to constrain, weight, or refine a finer attention or fusion process. In published systems, this pattern appears as coarse support-region selection before fine image attention in image-to-markup generation (Deng et al., 2016), a two-step hierarchical classifier in which coarse vehicle model/type prediction guides fine vehicle-ID attention (Wei et al., 2018), and saliency-guided fusion of a local $\texttt{[FG]}$ token with CLIP’s global $\texttt{[CLS]}$ token for fine-grained classification (Silva et al., 2 Oct 2025).

1. Conceptual foundations and historical development

An early explicit formulation appears in image-to-markup generation, where a “coarse-to-fine attention layer” selects a support region before applying fine attention in order to reduce the $O(THW)$ cost of attending over all image cells at every decoding step (Deng et al., 2016). A related but semantically different formulation appears in action recognition, where a coarse-to-fine network extracts shared deep features at different action class granularities—top-5, top-3, and top-1 candidate groups—and progressively integrates them with an LSTM (Lin et al., 2017). In vehicle re-identification, the same idea becomes a hierarchical conditional process: humans “firstly determine one vehicle’s coarse-grained category” and then identify the specific vehicle from subtle visual cues, which the RNN-HA model encodes as a length-2 sequence consisting of a coarse model/type step followed by a fine vehicle-ID step (Wei et al., 2018).

Later work broadened the concept beyond explicit two-stage classifiers. Fovea Transformer introduced a structured fine-to-coarse attention pattern in which nearby context is represented at fine granularity and farther context at increasingly coarser granularity via a multi-scale tree, yielding a smooth change in contextual resolution rather than a local/global jump (He et al., 2023). MAFormer separated fine local window attention from coarse global learning with down-sampling and fused them through cross-scale attention, while CFSum formalized a two-stage multimodal design in which a coarse self-attentive fusion stage is followed by a fine query-guided cross-attention stage (Wang et al., 2022, Guo et al., 1 Mar 2025). This suggests that “coarse-to-fine attention fusion” is not a single operator but a recurrent structural principle: early global, routed, or downsampled context shapes later local, token-level, or high-resolution processing.

2. Canonical architectural patterns

A recurring mathematical template is hierarchical attention factorization. In image-to-markup generation, a coarse latent variable $z_t'$ indexes a cell in a coarse feature grid, and fine attention is expressed as

$p(z_t)=\sum_{z_t'} p(z_t')\,p(z_t\mid z_t'),$

so coarse attention first defines support and fine attention then refines within that support (Deng et al., 2016). In RNN-HA, the same hierarchy is encoded recurrently: the coarse descriptor $\mathbf{x}_1$ produces $\mathbf{o}_1$ , which both predicts the coarse label and generates the attention guidance used to construct the fine descriptor $\mathbf{x}_2$ ; the fine decision is then made from $\mathbf{x}_2$ and the updated recurrent state (Wei et al., 2018). In TGFNet, the hierarchy is spatial rather than label-based: Key Region Routing selects top- $k$ regions by question–region similarity, Multi-head Cross-Attention operates on those routed regions, and Image Enhancement propagates that routed interaction back to all patches (Zhao et al., 2024).

Other systems instantiate the same principle through token construction or cross-level modulation. microCLIP first derives a saliency-guided local token $\texttt{[CLS]}$ 0 from patch embeddings and then fuses it with the global $\texttt{[CLS]}$ 1 token, so coarse global semantics and fine local evidence remain distinct until a later fusion step (Silva et al., 2 Oct 2025). FINE refines low-level high-resolution features with high-level contextual guidance before feature-pyramid fusion, treating the coarse map as Key/Value and the fine map as Query, and then applying the resulting modulation map back to the low-level feature (Lee et al., 12 Jun 2026). DuetGraph segregates local message passing and global attention into distinct pathways and fuses them only at the end, replacing stacked global-local mixing by

$\texttt{[CLS]}$ 2

which is a coarse-to-fine design at the level of candidate reasoning and representation fusion (Li et al., 15 Jul 2025).

Representative form	Coarse component	Fine component
Hierarchical support selection	Coarse grid or routed region	Fine attention within support
Recurrent semantic hierarchy	Coarse label/state	Fine descriptor and decision
Token-level global-local fusion	Global token or summary	Local token or patch attention
Cross-level feature alignment	High-level context	Low-level residual modulation
Dual-pathway reasoning	Global attention pathway	Local message-passing pathway

These forms differ in implementation, but all preserve an asymmetry: the coarse representation is not merely another feature map; it acts as a control signal for later, more specific computation.

3. Attention operators and fusion mechanisms

The most direct coarse-to-fine mechanism is coarse-conditioned spatial attention. In RNN-HA, the coarse recurrent output $\texttt{[CLS]}$ 3 is transformed by a two-layer MLP into a guidance vector $\texttt{[CLS]}$ 4, and each spatial descriptor $\texttt{[CLS]}$ 5 receives a score

$\texttt{[CLS]}$ 6

followed by global spatial normalization

$\texttt{[CLS]}$ 7

and attended pooling into $\texttt{[CLS]}$ 8 (Wei et al., 2018). The coarse representation therefore affects the fine step twice: through recurrent state propagation and through explicit spatial gating.

Transformer variants often replace recurrent conditioning by token construction and cross-attention. microCLIP’s SOAP module forms a saliency query

$\texttt{[CLS]}$ 9

then uses single-head attention pooling over patch tokens to produce $O(THW)$ 0; TokenFusion subsequently performs symmetric logit-level fusion,

$O(THW)$ 1

so the coarse global token and fine local token remain explicitly separable until the decision layer (Silva et al., 2 Oct 2025). CFSum instead uses coarse self-attention over the concatenated video–audio–text sequence to create globally fused modality streams and then applies fine cross-attention in which fused text attends separately to fused video and fused audio, with the resulting interactions combined by learned weights $O(THW)$ 2 and $O(THW)$ 3 (Guo et al., 1 Mar 2025).

A third family performs coarse-to-fine fusion across scales rather than across semantic labels or tokens. MAFormer computes a fine local representation with window attention and a coarse global representation via Global Learning with Down-sampling, then updates local tokens through

$O(THW)$ 4

so local queries selectively pull in coarse global context (Wang et al., 2022). FINE follows the same directionality with cross-level attention on sampled tokens, reshapes the attention output into a modulation map, upsamples it, and refines the low-level feature by residual gating,

$O(THW)$ 5

thereby preserving localization while injecting context (Lee et al., 12 Jun 2026). In TGFNet, the coarse stage is question-guided region routing and the fine stage is patch-level reweighting across the full image via similarity enhancement, so region selection and full-image refinement are explicitly decoupled (Zhao et al., 2024).

4. Optimization and supervision

Most differentiable coarse-to-fine attention-fusion systems are trained end-to-end with task losses applied at multiple levels. RNN-HA uses a coarse model/type cross-entropy and a fine vehicle-ID cross-entropy,

$O(THW)$ 6

with both losses backpropagating through the GRU, the attention module, and the CNN backbone; the paper states that no weighting hyperparameter is used and the losses are simply summed (Wei et al., 2018). TGFNet applies cross-entropy to the Optical Expert, SAR Expert, Fusion Expert, and final fused prediction,

$O(THW)$ 7

with $O(THW)$ 8, and reports that there are no separate auxiliary objectives for attention routing or expert sparsity (Zhao et al., 2024). EFN similarly supervises its coarse semantic path and fine boundary path with BCE losses on segmentation maps, boundary maps, and STN outputs, so the coarse-to-fine decomposition is trained only through downstream dense-prediction objectives (Feng et al., 2021).

When the coarse selection is discrete rather than differentiable, policy-gradient estimators are used. In image-to-markup generation, hard coarse attention samples one coarse cell $O(THW)$ 9 and trains the selector with REINFORCE using a discounted future reward $z_t'$ 0 and a moving-average baseline $z_t'$ 1,

$z_t'$ 2

with $z_t'$ 3 and $z_t'$ 4 for discounting and baseline updates, respectively (Deng et al., 2016). microCLIP combines differentiable TokenFusion with self-training: pseudo-labels are formed by Dynamic Knowledge Aggregation,

$z_t'$ 5

using $z_t'$ 6, and optimization then proceeds with a self-training cross-entropy plus fairness regularization (Silva et al., 2 Oct 2025). This division between fully differentiable refinement and discrete routing is one of the main training distinctions inside the literature.

5. Empirical behavior across tasks

The reported results consistently attribute measurable gains to coarse-to-fine fusion, but the magnitude and trade-off depend on domain and operator. In vehicle re-identification, the full RNN-HA model outperforms both an attention-only baseline without recurrence and a hierarchical baseline without attention, indicating that hierarchy and attention contribute separately and cumulatively (Wei et al., 2018). In remote-sensing VQA, the full TGFNet outperforms a simple multi-modal baseline and also improves over versions that remove CFAR or replace RQAF, suggesting that coarse region routing, fine enhancement, and adaptive multi-expert fusion all matter (Zhao et al., 2024). In query-focused video summarization, coarse multimodal self-attention and fine text-guided interaction each improve over the autoencoder-only configuration, while the full CFSum model performs best (Guo et al., 1 Mar 2025).

Task	Comparison	Reported result
VeRi vehicle re-ID	FC-HA → RNN-HA	mAP 47.19 → 52.88; top-1 61.56 → 66.03
VehicleID test size=800	FC-HA → RNN-HA	top-1 56.7 → 68.8; top-5 74.5 → 81.9
OSVQA	Exp1 → Exp4	OA 70.37 → 71.81; AA 62.45 → 65.12
QVHighlights	AE only → full CFSum	mAP 24.53 → 41.18; HIT@1 33.23 → 66.37

The literature also shows that coarse-to-fine mechanisms can be motivated by efficiency rather than pure accuracy. In image-to-markup generation, standard attention attains 77.46 Match on Im2latex-100k, hierarchical attention attains 77.39, coarse-to-fine sparsemax attains 76.15, and coarse-to-fine hard attention attains 74.90, while the number of fine lookups per step drops from about 355 in standard attention to 74 for sparsemax and 16 for hard attention (Deng et al., 2016). Fovea Transformer makes the same efficiency-oriented argument at long-context scale, using $z_t'$ 7 attention rather than $z_t'$ 8 by letting contextual granularity become coarser with distance (He et al., 2023). MAFormer shows the complementary accuracy case: MAFormer-L achieves 85.9\% Top-1 on ImageNet, while microCLIP reports 68.68\% average accuracy across 13 fine-grained datasets, a $z_t'$ 9 average gain over DPA and $p(z_t)=\sum_{z_t'} p(z_t')\,p(z_t\mid z_t'),$ 0 over zero-shot CLIP, tying coarse-fine fusion to performance gains in both pure vision and vision-language adaptation (Wang et al., 2022, Silva et al., 2 Oct 2025).

6. Limitations, misconceptions, and open directions

A common misconception is that coarse-to-fine attention fusion always means “global first, local second” in a single fixed sense. The surveyed systems do not support that reduction. Some methods use explicit sequential conditioning, such as RNN-HA’s coarse label path and image-to-markup’s support-region selection (Wei et al., 2018, Deng et al., 2016). Others use simultaneous multi-scale mixing, such as MAFormer’s parallel local and global branches or DuetGraph’s segregated local and global pathways (Wang et al., 2022, Li et al., 15 Jul 2025). Fovea Transformer reverses the wording entirely: it uses fine context nearby and increasingly coarse context farther away, yet still implements multi-scale attention fusion (He et al., 2023). The concept therefore concerns structural dependence between granularities, not a single mandatory operator order.

The main failure modes are equally heterogeneous. RNN-HA states that if coarse labels are noisy, ambiguous, or not informative, conditioning on them may hurt attention, and it also notes that its attention is single-source, purely spatial, and does not handle multi-scale, multi-head, or channel-wise variations (Wei et al., 2018). TGFNet reports only moderate gains on the most challenging relational question types, suggesting that coarse-to-fine spatial focusing and modality-adaptive fusion do not fully solve complex relational reasoning (Zhao et al., 2024). Fovea Transformer identifies fixed, hand-crafted scale–distance mapping, non-overlapping receptive fields, simple mean pooling, block-level approximations, and the absence of explicit inter-scale positional encoding as limitations of its current design (He et al., 2023). FINE shows that mis-specified alignment-aware sampling can be harmful: $p(z_t)=\sum_{z_t'} p(z_t')\,p(z_t\mid z_t'),$ 1 reduces cost but harms AP because of over-smoothing and ERF mismatch (Lee et al., 12 Jun 2026).

A plausible implication is that future coarse-to-fine attention-fusion systems will benefit most when they preserve the decisive asymmetry of the paradigm—coarse signals as control, fine signals as discrimination—while relaxing the rigid assumptions that current models often impose. The existing literature already points to the relevant axes: richer multi-head or multi-scale refinement instead of single-source spatial gating (Wei et al., 2018), learned rather than fixed scale schedules (He et al., 2023), and stronger robustness to noisy global priors in multimodal settings (Silva et al., 2 Oct 2025).