Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
149 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Cars Can't Fly up in the Sky: Improving Urban-Scene Segmentation via Height-driven Attention Networks (2003.05128v3)

Published 11 Mar 2020 in cs.CV

Abstract: This paper exploits the intrinsic features of urban-scene images and proposes a general add-on module, called height-driven attention networks (HANet), for improving semantic segmentation for urban-scene images. It emphasizes informative features or classes selectively according to the vertical position of a pixel. The pixel-wise class distributions are significantly different from each other among horizontally segmented sections in the urban-scene images. Likewise, urban-scene images have their own distinct characteristics, but most semantic segmentation networks do not reflect such unique attributes in the architecture. The proposed network architecture incorporates the capability exploiting the attributes to handle the urban scene dataset effectively. We validate the consistent performance (mIoU) increase of various semantic segmentation models on two datasets when HANet is adopted. This extensive quantitative analysis demonstrates that adding our module to existing models is easy and cost-effective. Our method achieves a new state-of-the-art performance on the Cityscapes benchmark with a large margin among ResNet-101 based segmentation models. Also, we show that the proposed model is coherent with the facts observed in the urban scene by visualizing and interpreting the attention map. Our code and trained models are publicly available at https://github.com/shachoi/HANet

Citations (151)

Summary

  • The paper introduces Height-driven Attention Networks (HANet), a novel module that improves urban-scene semantic segmentation by incorporating height-specific contextual information into existing architectures.
  • HANet works by generating a channel-wise attention map based on vertical position, allowing the network to selectively emphasize features relevant to specific height-dependent classes like roads or sky.
  • Experimental results on Cityscapes and BDD100K datasets show that adding HANet to various backbones consistently boosts segmentation performance, achieving a new state-of-the-art mIoU of 82.05% on Cityscapes with minimal overhead.

Improving Urban-Scene Segmentation with Height-Driven Attention Networks

Semantic segmentation is a crucial component in computer vision, particularly for urban-scene understanding in applications like autonomous driving. This paper introduces a novel architectural module, the Height-driven Attention Networks (HANet), designed to enhance semantic segmentation by leveraging the structural characteristics inherent in urban-scene images. The proposed HANet selectively emphasizes pixel classes based on vertical positional information, exploiting the distinct and predictable variations in pixel-wise class distributions across different vertical sections of urban-scene images.

Concept and Methodology

Urban-scene images are dominated by consistent vertical class distributions, such as roads appearing primarily at the bottom and sky at the top. Existing semantic segmentation architectures often overlook these spatial priors, leading to suboptimal performance. HANet is proposed as a lightweight, add-on module for improving semantic segmentation architectures. It incorporates height-specific contextual information to modulate the importance of features across different horizontal segments.

The HANet framework processes input feature maps to produce a channel-wise attention map, which adjusts the significance of features based on their vertical positions. The height-wise attention is built through a structured pipeline involving width-wise pooling, computation of height-driven attention maps via convolutional layers, and an optional integration of sinusoidal positional encoding.

Experimental Validation

Through extensive experiments on well-known urban-scene datasets, Cityscapes and BDD100K, the effectiveness and applicability of HANet are demonstrated. The addition of HANet to various backbone networks, such as ResNet-101, consistently yields improved mean Intersection over Union (mIoU) scores, indicating superior segmentation performance.

In the Cityscapes dataset, models with HANet outperform baseline models without significant increases in computational cost, achieving a new state-of-the-art mIoU of 82.05%. The module demonstrates its capacity to generalize across different datasets and backbones while maintaining minimal additional parameter overhead.

Results and Implications

HANet showcases its strengths through consistent performance improvements in segmenting urban-scene images, achieving a new benchmark on the Cityscapes dataset when combined with standard inference techniques like multi-scale and sliding window approaches. The methodology capitalizes on the predictable vertical spatial structures present in urban scenes, offering a cost-effective enhancement to existing segmentation architectures.

Visual analyses reveal that HANet assigns varying degrees of attention to feature channels based on vertical position, aligning with the observed pixel-wise class distribution patterns. Such insights confirm the initial hypothesis that height-wise information can enhance pixel classification tasks effectively.

Future Directions

The deployment of HANet opens pathways for further exploration in urban-scene segmentation and beyond. Its lightweight and scalable design invites integration with other architectures and domains where spatial priors play a significant role. Future research may explore adaptive mechanisms within HANet to dynamically learn positional priors in novel environments, potentially advancing beyond urban scenes.

In summary, Height-driven Attention Networks represent a scalable and effective approach to augment semantic segmentation in urban contexts, enlightening the role of inherent structural priors in enhancing computer vision tasks.

Youtube Logo Streamline Icon: https://streamlinehq.com