Cross-scale Feature Propagation Module
- Cross-scale Feature Propagation Module is a neural component that fuses high-resolution details with coarse semantic context to enhance feature representation.
- It leverages mechanisms like attention, deformable convolution, and KNN interpolation for effective multi-scale communication and alignment.
- Empirical results across segmentation, super-resolution, and compression tasks demonstrate improved accuracy and efficiency through explicit cross-resolution information flow.
to=arxiv_search.search 天天送钱 代理娱乐 code=json {"query":"all:(\"cross-scale feature propagation\" OR \"cross-scale attention\" OR \"feature propagation module\" semantic segmentation transformer)", "max_results": 10, "sort_by": "relevance"}## Search Results [ { "arxiv_id": "(Shi et al., 2022)", "title": "Transformer Scale Gate for Semantic Segmentation", "authors": [ "J. Zhao", "L. Zhao", "D. Tao" ], "abstract": "Effectively encoding multi-scale contextual information is crucial for accurate semantic segmentation. Existing transformer-based segmentation models combine features across scales without any selection, where features on sub-optimal scales may degrade segmentation outcomes. Leveraging from the inherent properties of Vision Transformers, we propose a simple yet effective module, Transformer Scale Gate (TSG), to optimally combine multi-scale features.TSG exploits cues in self and cross attentions in Vision Transformers for the scale selection. TSG is a highly flexible plug-and-play module, and can easily be incorporated with any encoder-decoder-based hierarchical vision Transformer architecture. Extensive experiments on the Pascal Context and ADE20K datasets demonstrate that our feature selection strategy achieves consistent gains." }, { "arxiv_id": "(Bai et al., 2021)", "title": "Multi-Scale Feature Aggregation by Cross-Scale Pixel-to-Region Relation Operation for Semantic Segmentation", "authors": [ "W. Wang", "J. Wang", "Y. Ling", "Q. Wang", "Y. Cao", "Y. Yang", "Z. Wang", "P. O. Ogunbona" ], "abstract": "Exploiting multi-scale features has shown great potential in tackling semantic segmentation problems. The aggregation is commonly done with sum or concatenation (concat) followed by convolutional (conv) layers. However, it fully passes down the high-level context to the following hierarchy without considering their interrelation. In this work, we aim to enable the low-level feature to aggregate the complementary context from adjacent high-level feature maps by a cross-scale pixel-to-region relation operation. We leverage cross-scale context propagation to make the long-range dependency capturable even by the high-resolution low-level features. To this end, we employ an efficient feature pyramid network to obtain multi-scale features. We propose a Relational Semantics Extractor (RSE) and Relational Semantics Propagator (RSP) for context extraction and propagation respectively. Then we stack several RSP into an RSP head to achieve the progressive top-down distribution of the context. Experiment results on two challenging datasets Cityscapes and COCO demonstrate that the RSP head performs competitively on both semantic segmentation and panoptic segmentation with high efficiency. It outperforms DeeplabV3 [1] by 0.7% with 75% fewer FLOPs (multiply-adds) in the semantic segmentation task." }, { "arxiv_id": "(Yu et al., 28 Aug 2025)", "title": "Re-Densification Meets Cross-Scale Propagation: Real-Time Compression of LiDAR Point Clouds", "authors": [ "P. Yu", "M. Quach", "Y. Liu", "R. Cohen" ], "abstract": "LiDAR point clouds are fundamental to various applications, yet high-precision scans incur substantial storage and transmission overhead. Existing methods typically convert unordered points into hierarchical octree or voxel structures for dense-to-sparse predictive coding. However, the extreme sparsity of geometric details hinders efficient context modeling, thereby limiting their compression performance and speed. To address this challenge, we propose to generate compact features for efficient predictive coding. Our framework comprises two lightweight modules. First, the Geometry Re-Densification Module re-densifies encoded sparse geometry, extracts features at denser scale, and then re-sparsifies the features for predictive coding. This module avoids costly computation on highly sparse details while maintaining a lightweight prediction head. Second, the Cross-scale Feature Propagation Module leverages occupancy cues from multiple resolution levels to guide hierarchical feature propagation. This design facilitates information sharing across scales, thereby reducing redundant feature extraction and providing enriched features for the Geometry Re-Densification Module. By integrating these two modules, our method yields a compact feature representation that provides efficient context modeling and accelerates the coding process. Experiments on the KITTI dataset demonstrate state-of-the-art compression ratios and real-time performance, achieving 26 FPS for both encoding and decoding at 12-bit quantization. Code is available at https://github.com/pengpeng-yu/FastPCC." }, { "arxiv_id": "(Kim et al., 2022)", "title": "Sequential Cross Attention Based Multi-task Learning", "authors": [ "S. Kim", "D. Min", "S. Kweon" ], "abstract": "In multi-task learning (MTL) for visual scene understanding, it is crucial to transfer useful information between multiple tasks with minimal interferences. In this paper, we propose a novel architecture that effectively transfers informative features by applying the attention mechanism to the multi-scale features of the tasks. Since applying the attention module directly to all possible features in terms of scale and task requires a high complexity, we propose to apply the attention module sequentially for the task and scale. The cross-task attention module (CTAM) is first applied to facilitate the exchange of relevant information between the multiple task features of the same scale. The cross-scale attention module (CSAM) then aggregates useful information from feature maps at different resolutions in the same task. Also, we attempt to capture long range dependencies through the self-attention module in the feature extraction network. Extensive experiments demonstrate that our method achieves state-of-the-art performance on the NYUD-v2 and PASCAL-Context dataset." }, { "arxiv_id": "(Zhou et al., 2022)", "title": "Feature Aggregation and Propagation Network for Camouflaged Object Detection", "authors": [ "T. Zhou", "M. Jiang", "W. Li", "Z. Wang", "Y. Wang", "X. Cheng", "L. Ma", "Q. Li" ], "abstract": "Camouflaged object detection (COD) aims to detect/segment camouflaged objects embedded in the environment, which has attracted increasing attention over the past decades. Although several COD methods have been developed, they still suffer from unsatisfactory performance due to the intrinsic similarities between the foreground objects and background surroundings. In this paper, we propose a novel Feature Aggregation and Propagation Network (FAP-Net) for camouflaged object detection. Specifically, we propose a Boundary Guidance Module (BGM) to explicitly model the boundary characteristic, which can provide boundary-enhanced features to boost the COD performance. To capture the scale variations of the camouflaged objects, we propose a Multi-scale Feature Aggregation Module (MFAM) to characterize the multi-scale information from each layer and obtain the aggregated feature representations. Furthermore, we propose a Cross-level Fusion and Propagation Module (CFPM). In the CFPM, the feature fusion part can effectively integrate the features from adjacent layers to exploit the cross-level correlations, and the feature propagation part can transmit valuable context information from the encoder to the decoder network via a gate unit. Finally, we formulate a unified and end-to-end trainable framework where cross-level features can be effectively fused and propagated for capturing rich context information. Extensive experiments on three benchmark camouflaged datasets demonstrate that our FAP-Net outperforms other state-of-the-art COD models. Moreover, our model can be extended to the polyp segmentation task, and the comparison results further validate the effectiveness of the proposed model in segmenting polyps. The source code and results will be released at https://github.com/taozh2017/FAPNet." }, { "arxiv_id": "(Feng et al., 2020)", "title": "Exploring Multi-Scale Feature Propagation and Communication for Image Super Resolution", "authors": [ "L. Chu", "C. Chen", "R. Xie", "Y. Ma", "H. Liu", "X. Lin", "Q. Tian" ], "abstract": "Multi-scale techniques have achieved great success in a wide range of computer vision tasks. However, while this technique is incorporated in existing works, there still lacks a comprehensive investigation on variants of multi-scale convolution in image super resolution. In this work, we present a unified formulation over widely-used multi-scale structures. With this framework, we systematically explore the two factors of multi-scale convolution -- feature propagation and cross-scale communication. Based on the investigation, we propose a generic and efficient multi-scale convolution unit -- Multi-Scale cross-Scale Share-weights convolution (MS-Conv). Extensive experiments demonstrate that the proposed MS-Conv can achieve better SR performance than the standard convolution with less parameters and computational cost. Beyond quantitative analysis, we comprehensively study the visual quality, which shows that MS-Conv behave better to recover high-frequency details." }, { "arxiv_id": "(Shang et al., 2023)", "title": "Vision Backbone Enhancement via Multi-Stage Cross-Scale Attention", "authors": [ "H. Kim", "C. Kim", "J. Park", "S. Lee", "S. Yoon", "J. C. Ye" ], "abstract": "Convolutional neural networks (CNNs) and vision transformers (ViTs) have achieved remarkable success in various vision tasks. However, many architectures do not consider interactions between feature maps from different stages and scales, which may limit their performance. In this work, we propose a simple add-on attention module to overcome these limitations via multi-stage and cross-scale interactions. Specifically, the proposed Multi-Stage Cross-Scale Attention (MSCSA) module takes feature maps from different stages to enable multi-stage interactions and achieves cross-scale interactions by computing self-attention at different scales based on the multi-stage feature maps. Our experiments on several downstream tasks show that MSCSA provides a significant performance boost with modest additional FLOPs and runtime." }, { "arxiv_id": "(Yang et al., 2024)", "title": "Edge-guided and Cross-scale Feature Fusion Network for Efficient Multi-contrast MRI Super-Resolution", "authors": [ "Y. Liu", "M. Tu", "J. Qian", "C. Chen" ], "abstract": "In recent years, MRI super-resolution techniques have achieved great success, especially multi-contrast methods that extract texture information from reference images to guide the super-resolution reconstruction. However, current methods primarily focus on texture similarities at the same scale, neglecting cross-scale similarities that provide comprehensive information. Moreover, the misalignment between features of different scales impedes effective aggregation of information flow. To address the limitations, we propose a novel edge-guided and cross-scale feature fusion network, namely ECFNet. Specifically, we develop a pipeline consisting of the deformable convolution and the cross-attention transformer to align features of different scales. The cross-scale fusion strategy fully integrates the texture information from different scales, significantly enhancing the super-resolution. In addition, a novel structure information collaboration module is developed to guide the super-resolution reconstruction with implicit structure priors. The structure information enables the network to focus on high-frequency components of the image, resulting in sharper details. Extensive experiments on the IXI and BraTS2020 datasets demonstrate that our method achieves state-of-the-art performance compared to other multi-contrast MRI super-resolution methods, and our method is robust in terms of different super-resolution scales. We would like to release our code and pre-trained model after the paper is accepted." }, { "arxiv_id": "(Guo et al., 2021)", "title": "Learning Cross-Scale Weighted Prediction for Efficient Neural Video Compression", "authors": [ "J. Li", "B. Li", "M. Lu", "W. Zhang", "S. Wang", "W. Zuo", "D. Zhao" ], "abstract": "Neural video codecs have demonstrated great potential in video transmission and storage applications. Existing neural hybrid video coding approaches rely on optical flow or Gaussian-scale flow for prediction, which cannot support fine-grained adaptation to diverse motion content. Towards more content-adaptive prediction, we propose a novel cross-scale prediction module that achieves more effective motion compensation. Specifically, on the one hand, we produce a reference feature pyramid as prediction sources and then transmit cross-scale flows that leverage the feature scale to control the precision of prediction. On the other hand, for the first time, a weighted prediction mechanism is introduced even if only a single reference frame is available, which can help synthesize a fine prediction result by transmitting cross-scale weight maps. In addition to the cross-scale prediction module, we further propose a multi-stage quantization strategy, which improves the rate-distortion performance with no extra computational penalty during inference. We show the encouraging performance of our efficient neural video codec (ENVC) on several benchmark datasets. In particular, the proposed ENVC can compete with the latest coding standard H.266/VVC in terms of sRGB PSNR on UVG dataset for the low-latency mode. We also analyze in detail the effectiveness of the cross-scale prediction module in handling various video content, and provide a comprehensive ablation study to analyze those important components." }
A Cross-scale Feature Propagation Module is a neural-network component that transfers, aligns, selects, or fuses representations across feature resolutions rather than processing each scale in isolation. In the recent literature, the concept appears under paper-specific names such as Transformer Scale Gate, Relational Semantics Propagator, Multi-Stage Cross-Scale Attention, Cross-Scale Attention Module, Cross-level Fusion and Propagation Module, Cross-scale Feature Fusion Module, Cross-Scale Cross-Attention, cross-scale weighted prediction, Cross-scale Feature Propagation, and cross-bit-depth feature propagation. Across these formulations, the recurring objective is to combine the boundary sensitivity and local detail of finer features with the broader contextual support, semantic abstraction, or denser structural evidence available at coarser features (Shi et al., 2022, Bai et al., 2021, Feng et al., 2020, Yu et al., 28 Aug 2025, Kim et al., 18 Nov 2025).
1. Conceptual basis
The central premise is that different scales encode different but complementary information. In semantic segmentation, high-resolution features preserve boundaries, small objects, and local detail, whereas low-resolution features provide larger effective receptive fields and stronger semantic context; the same papers also argue that naive summation, concatenation, or pyramid fusion can be harmful because “features on sub-optimal scales may decrease the segmentation accuracy” (Shi et al., 2022). In point-cloud representation, the top branch has the highest resolution but lowest receptive field, while the bottom branch has the lowest resolution and largest receptive field, so cross-scale interaction becomes a mechanism for combining fine local geometry with broader context (Han et al., 2021). In image super-resolution, this distinction is formalized as the difference between feature propagation within a scale branch and cross-scale communication between branches (Feng et al., 2020).
A useful abstraction is given by the two-scale formulation used in multi-scale convolution for super-resolution:
Here and are intra-scale transformations, while and are inter-scale transformations (Feng et al., 2020). This suggests that “Cross-scale Feature Propagation Module” is best treated as a functional category: some modules propagate raw features, some propagate relations or attention maps, and some propagate occupancy-conditioned latent states.
2. Representative architectural patterns
The literature does not converge on a single canonical implementation. Instead, several recurring patterns appear across tasks and data modalities.
| Pattern | Representative module | Core mechanism |
|---|---|---|
| Patch-wise scale selection | TSG / TSGE / TSGD | Per-patch gates from transformer self-attention or cross-attention |
| Localized coarse-to-fine semantic transfer | RSE / RSP | Low-level pixel queries a region in an upsampled higher-level map |
| Multi-stage all-scale attention branch | MSCSA | Queries at one scale attend to keys/values from original and coarser scales |
| Adjacent-level gated propagation | CFPM | Adjacent encoder levels are fused, then spatial gates mix current and propagated signals |
| Alignment-before-fusion | CFFM | Deformable alignment, channel alignment, then dual cross-attention |
| Common-resolution point-set fusion | CSCA | KNN interpolation to original points, then intra-scale and inter-scale attention |
| Hierarchical occupancy-guided transfer | XFP / ELiC | Propagation follows octree or bit-depth hierarchy using occupancy masks |
| Attention-map reuse | CSAP | Attention computed at deepest stage is propagated to shallower stages |
These patterns indicate that “propagation” may refer to at least four distinct objects: feature tensors, local relation responses, attention weights, or structural occupancy-conditioned latent states. A plausible implication is that the defining property is not the algebraic form of the message, but the fact that cross-resolution information is made explicit and causally available to the target scale.
3. Attention-guided formulations
Attention-guided propagation is prominent in transformer-based dense prediction. Transformer Scale Gate (TSG) uses encoder self-attention and decoder cross-attention to generate per-patch scale weights, with gate prediction defined as
so that 0 contains per-patch scale weights (Shi et al., 2022). In the encoder variant TSGE, coarse refined features are upsampled and fused progressively into finer scales; in the decoder variant TSGD, all refined encoder scales are aligned to a common resolution and combined as
1
The key distinction drawn in the paper is that TSG itself predicts the gates, whereas TSGE and TSGD are the actual cross-scale integration mechanisms.
MSCSA generalizes the same idea to a backbone add-on branch rather than a decoder gate. It first pools features from several backbone stages to a common spatial size, concatenates them into a multi-stage feature map, then constructs multiple scale views 2 of that fused tensor. Queries are taken from the base scale, while keys and values are projected independently at original, 3-coarser, and 4-coarser resolutions, then concatenated across the token dimension before attention (Shang et al., 2023). The paper explicitly interprets this as both multi-stage aggregation and cross-scale feature exchange.
In multi-task dense prediction, CSAM uses the same cross-attention primitive as CTAM but changes the domain of interaction: CTAM exchanges information across tasks at fixed scale, whereas CSAM updates a target scale by attending to all coarser scales within the same task (Kim et al., 2022). The propagation is directional and coarse-to-fine, and the paper states that CSAM does not use a residual addition, unlike CTAM.
CrossFormer and CrossFormer++ shift the focus from decoder fusion to token construction and grouped attention. Cross-scale Embedding Layer (CEL) samples same-center patches at multiple sizes and concatenates their embeddings, while Long-Short Distance Attention (LSDA) alternates short-distance and long-distance grouped self-attention so that both local and distant dependencies are propagated through depth (Wang et al., 2023). CrossFormer++ then adds Progressive Group Size and Amplitude Cooling Layer, using a manually designed stage-wise group-size policy 5 and periodic residual-free cooling to stabilize deep propagation (Wang et al., 2023).
A more radical reinterpretation appears in CSAP, where the propagated object is not a feature tensor but an attention map. CSAP computes attention once at the deepest decoder stage, compresses and transforms that attention, and reuses it at shallower stages so that only value projections remain stage-specific. In the reported analysis, decoder matrix-multiplication FLOPs drop from 6G to 7G (Kang, 7 Apr 2026). This demonstrates that cross-scale propagation need not move features directly; it can also move the relational operator that governs feature aggregation.
4. Relation-based, convolutional, and gated formulations
Not all cross-scale propagation modules are transformer-style attention blocks. In RSP-based segmentation, the lower-resolution feature is first upsampled, but the finer feature does not simply receive the corresponding pixel; instead, a low-level pixel feature queries a 8 region in the adjacent higher-level feature map through a cross-scale pixel-to-region relation. The extracted relational semantic feature is then added back residually:
9
The paper explicitly distinguishes the Relational Semantics Extractor (RSE), which computes and extracts the cross-scale relation, from the Relational Semantics Propagator (RSP), which performs one propagation step, and from the stacked RSP head, which realizes progressive top-down distribution across the pyramid (Bai et al., 2021).
The super-resolution literature offers a different perspective by separating propagation and communication in a linear operator view. MS0-Conv keeps same-scale propagation on two branches but adds bidirectional high↔low exchange, with the proposed transform
1
Thus the high-resolution branch receives shared 2 intra-scale propagation plus upsampled low-scale communication, while the low-resolution branch receives the same shared intra-scale propagation plus average-pooled high-scale communication (Feng et al., 2020). The paper’s empirical conclusion is that bi-directional cross-scale connections are essential.
Adjacent-level gating appears explicitly in FAP-Net. CFPM first fuses adjacent aggregated encoder features, then combines the current fused feature and the previously propagated feature through two spatial gates 3 and 4, yielding
5
The paper frames this as adaptive encoder-to-decoder propagation rather than a simple skip connection (Zhou et al., 2022).
Alignment-centric designs are exemplified by CFFM in multi-contrast MRI super-resolution. There, adjacent encoder scales are fused only after spatial alignment by deformable convolution and channel alignment by pooled channel descriptors, with residual channel recalibration
6
After that, a dual cross-attention transformer integrates the fused source feature with the reference feature to produce stage-wise texture 7 (Yang et al., 2024). This suggests that, when scale mismatch is severe, explicit alignment may be a prerequisite for effective propagation.
5. Extensions beyond 2D dense prediction
The same functional idea appears in point clouds, neural compression, and time-series modeling, but with different structural constraints. In CLCSCANet, branch outputs at three point-cloud resolutions are first upsampled to the original point set by KNN interpolation and shared MLPs,
8
then refined by intra-scale self-attention and fused by inter-scale cross-attention at common point resolution (Han et al., 2021). Here propagation is neither top-down nor bottom-up in the classical FPN sense; it is alignment to a shared point domain followed by joint multi-scale fusion.
In neural video compression, ENVC uses a reference feature pyramid 9, cross-scale flows, and cross-scale weight maps. At each output location, multiple samples are drawn from each scale and aggregated by softmax-computed weights, so the propagated quantity is effectively a weighted blend of multi-scale reference features rather than a single warped feature map (Guo et al., 2021). The paper explicitly argues that higher-resolution scales favor precise simple-motion prediction, while coarser scales are more robust for zoom, rotation, blur, and uncertain motion.
LiDAR geometry compression introduces a particularly explicit form of structural propagation. XFP propagates features along octree levels using occupancy cues from already encoded levels, with separate shallow and deep regimes; the deep regime incorporates re-densification before re-sparsifying features for prediction (Yu et al., 28 Aug 2025). ELiC makes the same idea even more explicit at bit-depth levels. Current-scale octant embedding 0 is fused with propagated lower-bit-depth feature 1 by channel-wise gating,
2
and propagation to the next level is then performed by exact replication and occupancy masking:
3
This is coarse-to-fine propagation in a bit-depth hierarchy rather than in a 2D feature pyramid (Kim et al., 18 Nov 2025).
A related but not identical case is the Conv-like ScaleFusion Time Series Transformer. Its Cross-Scale Attention Mechanism uses current coarser features as queries and previous finer features as keys/values, then applies residual fusion
4
after re-patching has shortened temporal length and increased channel dimension (Zhang et al., 22 Sep 2025). This suggests that the concept extends naturally to temporal scales as well.
6. Empirical behavior, trade-offs, and scope boundaries
The empirical record is broadly consistent: explicit cross-scale propagation improves performance when naive fusion or per-level independence is limiting. In semantic segmentation, TSG improves the Swin Transformer + UperNet baseline on Pascal Context from 5 to 6 mIoU with Swin-Tiny and from 7 to 8 with Swin-Large; on ADE20K it improves from 9 to 0 with Swin-Tiny and from 1 to 2 with Swin-Large (Shi et al., 2022). RSP head outperforms DeeplabV3 by 3 with 4 fewer FLOPs in semantic segmentation (Bai et al., 2021). MSCSA improves ResNet-50 with Mask R-CNN on COCO from 5 to 6 box AP and from 7 to 8 mask AP with only 9 FLOPs, and on ADE20K gives PVTv2-B0/B1/B2 gains of 0 mIoU with under 1 overhead (Shang et al., 2023). In MRI super-resolution, removing multi-scale feature alignment in ECFNet yields 2 PSNR / 3 SSIM, whereas the full model reports 4 PSNR / 5 SSIM (Yang et al., 2024). In LiDAR compression, XFP is part of a design that reports 6 FPS for both encoding and decoding at 7-bit quantization (Yu et al., 28 Aug 2025), while ELiC without BoE already gives 8 D1 BD-rate on Ford and 9 on SemanticKITTI relative to RENO (Kim et al., 18 Nov 2025).
The same literature also clarifies scope boundaries. CFPNet is primarily a cross-zone propagation design applied at several resolutions: DAPM forces outside-zone pixels to retrieve in-zone features, and LKPM propagates across spatial regions within a stage rather than directly between scales (Ding et al., 2024). TAFPNet contains both temporal propagation and multiscale pyramid enhancement, but its clearest propagator is the Temporal Query Propagator, while AAFP is better characterized as multiscale spatiotemporal enhancement than as explicit level-to-level propagation (Yuan et al., 18 Apr 2025). These cases indicate that the label “Cross-scale Feature Propagation Module” is most precise when scale transfer itself is the operative message-passing mechanism, rather than merely the deployment setting.
Taken together, the literature supports a stable technical characterization. A Cross-scale Feature Propagation Module is not defined by one canonical operator, but by a recurring structural move: use information already available at one scale to condition representation construction at another scale. The propagated object may be a feature tensor, a local relation response, an attention distribution, or an occupancy-conditioned latent state; the alignment may be learned by attention, deformable convolution, KNN interpolation, exact replication, or deterministic resampling; and the topology may be patch-wise, adjacent-level, all-scale, or hierarchical. What remains constant is the attempt to replace scale-isolated processing with explicit cross-resolution information flow.