DFormerv2: Geometry Self-Attention for RGBD Semantic Segmentation (2504.04701v1)

Published 7 Apr 2025 in cs.CV

Abstract: Recent advances in scene understanding benefit a lot from depth maps because of the 3D geometry information, especially in complex conditions (e.g., low light and overexposed). Existing approaches encode depth maps along with RGB images and perform feature fusion between them to enable more robust predictions. Taking into account that depth can be regarded as a geometry supplement for RGB images, a straightforward question arises: Do we really need to explicitly encode depth information with neural networks as done for RGB images? Based on this insight, in this paper, we investigate a new way to learn RGBD feature representations and present DFormerv2, a strong RGBD encoder that explicitly uses depth maps as geometry priors rather than encoding depth information with neural networks. Our goal is to extract the geometry clues from the depth and spatial distances among all the image patch tokens, which will then be used as geometry priors to allocate attention weights in self-attention. Extensive experiments demonstrate that DFormerv2 exhibits exceptional performance in various RGBD semantic segmentation benchmarks. Code is available at: https://github.com/VCIP-RGBD/DFormer.

Summary

The paper introduces a geometry prior mechanism that fuses depth and spatial distances to inform Transformer self-attention.
It proposes Geometry Self-Attention with learnable decay rates and axes decomposition to efficiently handle high-resolution features.
The approach achieves state-of-the-art performance on RGB-D benchmarks while reducing computational overhead by eliminating separate depth encoders.

This paper introduces DFormerv2, a novel vision backbone architecture for RGB-D semantic segmentation that utilizes depth information as an explicit geometry prior rather than encoding it through dedicated neural network layers (2504.04701). The core idea is to leverage the 3D geometric relationships inherent in depth maps to guide the attention mechanism within a Transformer-based encoder.

Key Contributions:

Geometry Prior: The paper proposes generating a "geometry prior" by combining depth distances and spatial distances between image patches. This prior encapsulates the 3D relationships within the scene.
Geometry Self-Attention (GSA): A new self-attention mechanism is introduced where the calculated geometry prior modulates the standard attention weights. This allows the model to focus attention based on geometric proximity and structure.
Efficient RGB-D Encoder (DFormerv2): A hierarchical Vision Transformer encoder is built using GSA blocks. Notably, it processes only the RGB image through the main network trunk, while the depth map is solely used to compute the geometry priors for the GSA modules at different stages, eliminating the need for a separate depth encoder or complex fusion modules.

Methodology:

Geometry Prior Generation:
- Depth Prior ( $D$ ): For an input image divided into patches, the average depth value within each patch is calculated. The absolute difference in average depth between pairs of patches $(i,j)$ and $(i',j')$ forms the depth distance $D_{ij,i'j'} = |z_{ij} - z_{i'j'}|$ .
- Spatial Prior ( $S$ ): The Manhattan distance between the spatial coordinates of patches $(i,j)$ and $(i',j')$ is calculated: $S_{ij,i'j'} = |i-i'| + |j-j'|$ .
- Fusion ( $G$ ): The depth prior matrix $D$ and spatial prior matrix $S$ (both of size $HW \times HW$ ) are fused using a weighted summation with two learnable parameters (memories) to create the final geometry prior matrix $G$ .
Geometry Self-Attention (GSA):
- Standard self-attention computes attention scores as $\mathrm{Softmax}(QK^T)V$ .
- GSA modifies this by introducing a decay factor based on the geometry prior: $\mathrm{GeoAttn}(Q,K,V,G) = (\mathrm{Softmax}(QK^T) \odot \beta^G)V$ .
- Here, $\beta \in (0,1)$ is a learnable decay rate, and $\beta^G$ is a matrix where each element is $\beta$ raised to the power of the corresponding element in $G$ . This element-wise multiplication suppresses attention between geometrically distant patches and enhances attention between nearby ones. Different heads use different decay rates $\beta$ (sampled from a range like $[0.75, 1.0)$ ) for diversity.
Axes Decomposition: To reduce the $O((HW)^2)$ complexity of GSA, especially for high-resolution feature maps in early stages, the attention is decomposed along the horizontal ( $x$ ) and vertical ( $y$ ) axes. Separate geometry priors ( $G^x, G^y$ ) are computed for rows and columns, and attention is applied sequentially: $\mathrm{GeoAttn} = \mathrm{GeoAttn}^y (\mathrm{GeoAttn}^x V)^T$ .

DFormerv2 Architecture:

Encoder: A standard hierarchical Transformer encoder with four stages, producing feature maps at 1/4, 1/8, 1/16, and 1/32 resolution. It uses GSA blocks instead of standard self-attention. Axes-decomposed GSA is used in the first three stages, while the last stage uses standard GSA.
Input: An RGB image is passed through a stem layer (two 3x3 convolutions) and then fed into the encoder. The corresponding depth map is downsampled via average pooling to match the resolution of each stage and used solely to generate the geometry priors ( $G$ , $G^x$ , $G^y$ ) for the GSA blocks. No explicit depth feature extraction network is used.
Decoder: A lightweight decoder head (e.g., from (Junior et al., 2021)) takes features from the last three encoder stages to predict the final segmentation map.
Variants: DFormerv2-S, DFormerv2-B, and DFormerv2-L models are presented with varying sizes and computational costs.

Implementation Details & Experiments:

Pretraining: Models are pretrained on ImageNet-1K using estimated depth maps, employing a standard cross-entropy loss.
Finetuning: Evaluated on NYU DepthV2, SUNRGBD, and Deliver datasets using cross-entropy loss and AdamW optimizer. Standard augmentations like random flipping and scaling are used.
Results: DFormerv2 achieves state-of-the-art results on all three benchmarks across different model scales (Small, Base, Large). Notably, DFormerv2-L achieves 58.4% mIoU on NYU DepthV2 with significantly lower FLOPs (124.1G) compared to the previous best (GeminiFusion-B5 at 57.7% mIoU with 256.1G FLOPs). Similar efficiency gains are observed on SUNRGBD and Deliver.
Ablations: Experiments confirm the effectiveness of both depth and spatial priors, the learnable fusion method for combining them, the axes decomposition for efficiency, and the multi-scale decay rate strategy. Visualizations show the geometry prior capturing object structure and the GSA focusing attention effectively.
Latency: Inference latency tests on an RTX 3090 show DFormerv2 offers a superior speed-accuracy trade-off compared to competitors.

Practical Implications:

DFormerv2 offers a more efficient way to leverage depth information in semantic segmentation models. By treating depth as a geometric prior for attention rather than a separate modality requiring explicit encoding and fusion, it reduces computational overhead and parameter count while achieving state-of-the-art accuracy. This makes it potentially well-suited for applications where computational resources are constrained, such as robotics or autonomous driving, without sacrificing performance. The core GSA mechanism could potentially be applied to other vision tasks involving depth or other auxiliary geometric data.

The implementation requires calculating pairwise distances (depth and spatial) and applying the decay modulation within the attention mechanism. The code is available at the provided GitHub link. Key considerations include efficient calculation of the prior matrices and managing the different decay rates per head.

import torch
import torch.nn as nn
import torch.nn.functional as F

def generate_geometry_prior(depth_map, patch_size, H, W):
    # depth_map: Input depth map (B, 1, H_img, W_img)
    # patch_size: Size of the patch (e.g., 16)
    # H, W: Number of patches along height and width

    # 1. Get patch depth representations
    # Use average pooling to get average depth per patch
    avg_pool = nn.AvgPool2d(kernel_size=patch_size, stride=patch_size)
    patch_depths = avg_pool(depth_map) # Shape: (B, 1, H, W)
    patch_depths = patch_depths.view(B, H * W) # Shape: (B, HW)

    # 2. Calculate Depth Distance Matrix D
    # Expand dims to compute pairwise differences
    z_diff = patch_depths.unsqueeze(2) - patch_depths.unsqueeze(1) # Shape: (B, HW, HW)
    D = torch.abs(z_diff) # Shape: (B, HW, HW)

    # 3. Calculate Spatial Distance Matrix S
    coords_h = torch.arange(H, device=depth_map.device)
    coords_w = torch.arange(W, device=depth_map.device)
    coords = torch.stack(torch.meshgrid(coords_h, coords_w, indexing='ij'), dim=-1) # Shape: (H, W, 2)
    coords_flat = coords.view(H * W, 2) # Shape: (HW, 2)
    # Compute pairwise Manhattan distances
    s_diff = coords_flat.unsqueeze(1) - coords_flat.unsqueeze(0) # Shape: (HW, HW, 2)
    S = torch.abs(s_diff[..., 0]) + torch.abs(s_diff[..., 1]) # Shape: (HW, HW)
    S = S.unsqueeze(0).expand(B, -1, -1) # Shape: (B, HW, HW)

    # 4. Fuse D and S (Simplified: learnable weights w1, w2 per model)
    # In practice, these weights (memories) are learnable parameters
    w1 = 0.5 # Example weight
    w2 = 0.5 # Example weight
    G = w1 * D + w2 * S # Shape: (B, HW, HW)

    return G

class GeometrySelfAttention(nn.Module):
    def __init__(self, dim, num_heads, decay_min=0.75, decay_max=1.0):
        super().__init__()
        self.num_heads = num_heads
        self.head_dim = dim // num_heads
        self.scale = self.head_dim ** -0.5

        self.qkv = nn.Linear(dim, dim * 3)
        self.proj = nn.Linear(dim, dim)

        # Linearly sample decay rates for each head
        self.decay_rates = torch.linspace(decay_min, decay_max, num_heads)
        # In practice, register as buffer or parameter if learnable

    def forward(self, x, geometry_prior_G):
        # x: input features (B, HW, C)
        # geometry_prior_G: precomputed geometry prior (B, HW, HW)
        B, N, C = x.shape

        qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, self.head_dim).permute(2, 0, 3, 1, 4)
        q, k, v = qkv[0], qkv[1], qkv[2] # Shape: (B, num_heads, HW, head_dim)

        attn = (q @ k.transpose(-2, -1)) * self.scale # Shape: (B, num_heads, HW, HW)
        attn = attn.softmax(dim=-1)

        # Apply Geometry Prior Decay
        # Ensure decay_rates and G are on the same device
        self.decay_rates = self.decay_rates.to(geometry_prior_G.device)
        # Reshape decay rates for broadcasting: (1, num_heads, 1, 1)
        decay_rates_b = self.decay_rates.view(1, -1, 1, 1)
        # Reshape G for broadcasting: (B, 1, HW, HW)
        geometry_prior_G_b = geometry_prior_G.unsqueeze(1)
        # Calculate decay matrix: beta^G per head
        decay_matrix = decay_rates_b ** geometry_prior_G_b # Shape: (B, num_heads, HW, HW)

        # Modulate attention map
        attn = attn * decay_matrix # Element-wise multiplication

        x = (attn @ v).transpose(1, 2).reshape(B, N, C)
        x = self.proj(x)
        return x

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Find Related Papers

Authors (4)

GitHub

GitHub - VCIP-RGBD/DFormer: [ICLR 2024] DFormer: Rethinking RGBD Representation Learning for Semantic Segmentation (185 stars)

YouTube

Show All Videos