Hierarchical Awareness Adapters with Hybrid Pyramid Feature Fusion for Dense Depth Prediction

Published 3 Apr 2026 in cs.CV | (2604.03339v1)

Abstract: Monocular depth estimation from a single RGB image remains a fundamental challenge in computer vision due to inherent scale ambiguity and the absence of explicit geometric cues. Existing approaches typically rely on increasingly complex network architectures to regress depth maps, which escalates training costs and computational overhead without fully exploiting inter-pixel spatial dependencies. We propose a multilevel perceptual conditional random field (CRF) model built upon the Swin Transformer backbone that addresses these limitations through three synergistic innovations: (1) an adaptive hybrid pyramid feature fusion (HPF) strategy that captures both short-range and long-range dependencies by combining multi-scale spatial pyramid pooling with biaxial feature aggregation, enabling effective integration of global and local contextual information; (2) a hierarchical awareness adapter (HA) that enriches cross-level feature interactions within the encoder through lightweight broadcast modules with learnable dimensional scaling, reducing computational complexity while enhancing representational capacity; and (3) a fully-connected CRF decoder with dynamic scaling attention that models fine-grained pixel-level spatial relationships, incorporating a bias learning unit to prevent extreme-value collapse and ensure stable training. Extensive experiments on NYU Depth v2, KITTI, and MatterPort3D datasets demonstrate that our method achieves state-of-the-art performance, reducing Abs Rel to 0.088 ($-$7.4\%) and RMSE to 0.316 ($-$5.4\%) on NYU Depth v2, while attaining near-perfect threshold accuracy ($δ< 1.25³ \approx 99.8\%$) on KITTI with only 194M parameters and 21ms inference time.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper introduces a novel dense depth estimation framework that integrates hierarchical adapters with hybrid pyramid feature fusion, achieving state-of-the-art performance.
It employs multi-scale pooling and biaxial feature aggregation to capture both local details and global context, leading to enhanced accuracy across benchmarks.
The fully-connected CRF decoder with dynamic attention refines spatial consistency and edge fidelity while maintaining computational efficiency.

Hierarchical Awareness Adapters with Hybrid Pyramid Feature Fusion for Dense Depth Prediction

Overview

The manuscript "Hierarchical Awareness Adapters with Hybrid Pyramid Feature Fusion for Dense Depth Prediction" (2604.03339) presents a method for monocular depth estimation that integrates a conditional random field (CRF) model with advanced transformer-based representation learning. The network leverages the Swin Transformer for hierarchical feature extraction, incorporates a novel Hybrid Pyramid Feature Fusion (HPF) module for context aggregation, and introduces a Hierarchical Awareness Adapter (HA) to optimize cross-level information propagation. The decoder utilizes a fully-connected CRF with dynamic attention for refined spatial reasoning. The approach is evaluated on NYU Depth v2, KITTI, and MatterPort3D, achieving superior accuracy with notable computational efficiency.

Technical Contributions

Hybrid Pyramid Feature Fusion (HPF)

HPF is designed to address the local-global contextual aggregation challenge by integrating spatial pyramid pooling with biaxial aggregation. The method operates as follows:

Multi-Scale Fusion: Features are pooled at multiple scales ( $1 \times 1$ , $2 \times 2$ , $3 \times 3$ ), depth-wise convolved, then upsampled and concatenated, combining global priors with local detail.
Biaxial Feature Aggregation: Features are separately summarized along horizontal and vertical axes, allowing the model to capture elongated contextual relationships, which standard pooling often misses, especially on irregular or weakly-textured objects.

This fusion outperforms single-path or high-scale-only variants in terms of both error metrics and inference time, optimizing the trade-off between spatial coverage and parameter overhead.

Hierarchical Awareness Adapter (HA)

To mitigate the rigidity and potential over-complexity of standard hierarchically stacked transformers, the HA module is strategically inserted at each encoder stage:

Perception Module: Integrates down-projection and up-projection (convolution and transposed convolution) for channel recalibration, enhancing feature discrimination in both shallow (edge/detail-oriented) and deep (semantic) layers.
Broadcast Module: Facilitates global information flow via feature-wise average pooling and broadcast addition, modulated by learnable per-channel weighting for flexible attention allocation.
Dimensional Scaling: Provides per-channel scaling, improving representational sparsity and regularizing MSA gradients, supporting stable and efficient training.

The HA module reduces computational redundancy while enriching both local and non-local perceptions across the encoder.

Fully-Connected CRF Decoder with Dynamic Attention

The decoder adopts a fully-connected CRF structure:

Energy Modeling: Combines convolutional unary potentials with pairwise terms parameterized by cosine similarity of feature embeddings, further conditioned on explicit positional encodings.
Dynamic Scaling Attention: Window-based attention mechanisms inspired by shifted-window transformers permit scalable, localized pairwise reasoning while the inclusion of a bias unit and learnable temperature $\tau$ grants adaptable attention dispersion and robust convergence.
Region-Based Partitioning and Pixel Shuffle: Local 'patches' are used for dense intra-region modeling, improved by pixel upsampling for sharp edge recovery in the output map.

This CRF-based inference directly optimizes spatial consistency and edge fidelity, addressing typical artifacts of convolutional and transformer-based decoders.

Empirical Evaluation

Results on Standard Benchmarks

The approach achieves the following strong performance metrics:

NYU Depth v2: Absolute Relative Error (Abs Rel) 0.088, down by 7.4% vs. NeWCRFs; RMSE 0.316, down by 5.4%; $\delta < 1.25^3$ accuracy 99.8%.
KITTI: Abs Rel 0.049; RMSE 2.062; $\delta < 1.25^2$ and $\delta < 1.25^3$ above 99.8%, consistently outperforming strong competitors such as PixelFormer, NeWCRFs, DDP, and ZoeDepth.
MatterPort3D: Abs Rel 0.0574, best among comparison methods, supporting robust domain generalization.

Significantly, these results are obtained with 194M parameters and rapid single-frame inference (21 ms), exhibiting favorable efficiency compared to prior art despite the added CRF component.

Ablation and Complexity

Ablation studies confirm that HPF, HA, and CRF modules all contribute distinct, non-redundant gains. Fusion strategies that combine multi-scale and biaxial pathways are especially effective. Complexity analysis demonstrates lower computational and memory scaling at high resolutions, with windowed attention and selective adaptation in HA providing practical gains over $O(n^2)$ fully-connected attention schemes.

Implications and Future Directions

This architecture sets a new state-of-the-art for dense monocular depth estimation in terms of both accuracy and efficiency. The synergy of transformer-based encoders with adaptive, hierarchical cross-level communication and explicit structured reasoning via CRFs marks a notable advance in low-level vision modeling.

Practical Impact: The speed, parameter efficiency, and edge-preserving quality render the method suitable for deployment in computationally constrained applications, ranging from mobile robotics to augmented reality.
Theoretical Significance: The work exemplifies the effective hybridization of self-attention, CRFs, and context-aware adapters, suggesting a general strategy for unifying local and global spatial reasoning in dense prediction tasks.
Broader Scope: The presented techniques could be extensible to other spatially structured vision problems, e.g., semantic/instance segmentation, and support future integration with multimodal or temporal inputs for unified scene understanding.

Potential developments include joint training with multimodal supervision, learning more expressive pairwise potentials, or further optimizing inference for embedded contexts. Opportunities also exist for adaptation to video, large-scale outdoor datasets, or integration with probabilistic uncertainty quantification for autonomous systems.

Conclusion

The presented framework demonstrates that careful integration of hybrid pyramid fusion, hierarchical adapters, and CRF-based decoders enables highly accurate and efficient monocular depth estimation. The dominant performance across major benchmarks, in conjunction with strong generalization and resource efficiency, establishes a compelling new baseline for structured dense prediction, with promising implications for both future research and practical deployment in vision systems.

Markdown Report Issue