Deep High-Resolution Representation Learning for Visual Recognition (1908.07919v2)

Published 20 Aug 2019 in cs.CV

Abstract: High-resolution representations are essential for position-sensitive vision problems, such as human pose estimation, semantic segmentation, and object detection. Existing state-of-the-art frameworks first encode the input image as a low-resolution representation through a subnetwork that is formed by connecting high-to-low resolution convolutions \emph{in series} (e.g., ResNet, VGGNet), and then recover the high-resolution representation from the encoded low-resolution representation. Instead, our proposed network, named as High-Resolution Network (HRNet), maintains high-resolution representations through the whole process. There are two key characteristics: (i) Connect the high-to-low resolution convolution streams \emph{in parallel}; (ii) Repeatedly exchange the information across resolutions. The benefit is that the resulting representation is semantically richer and spatially more precise. We show the superiority of the proposed HRNet in a wide range of applications, including human pose estimation, semantic segmentation, and object detection, suggesting that the HRNet is a stronger backbone for computer vision problems. All the codes are available at~{\url{https://github.com/HRNet}}.

Citations (3,174)

View on Semantic Scholar

Summary

The paper presents HRNet, a neural architecture that maintains high-resolution representations via parallel streams and repeated fusion for enhanced visual recognition.
HRNet demonstrates significant performance improvements, achieving a 76.3 AP in human pose estimation and superior mIoU scores on benchmarks like Cityscapes and PASCAL-Context.
The approach’s continuous high-resolution design offers practical benefits for applications such as augmented reality and self-driving cars, while inspiring future deep network innovations.

Deep High-Resolution Representation Learning for Visual Recognition

The paper presents a novel neural network architecture, named the High-Resolution Network (HRNet), designed specifically to handle position-sensitive vision tasks such as human pose estimation, semantic segmentation, and object detection. Traditional deep CNNs, including ResNet and VGGNet, typically encode input images into low-resolution representations and then employ various strategies to recover high-resolution representations. Contrarily, the HRNet maintains high-resolution representations throughout its entire process.

The key characteristics of HRNet that set it apart are:

Parallel High-to-Low Resolution Convolution Streams: Rather than connecting high-to-low resolution streams in series, HRNet connects them in parallel, ensuring that high-resolution information is not lost early in the network.
Repeated Multi-Resolution Fusion: Information across resolutions is repeatedly exchanged, making the resulting representation semantically richer and spatially more precise.

Numerical Results and Implications

Empirical evaluations illustrate that HRNet significantly outperforms existing state-of-the-art methods across several tasks:

Human Pose Estimation: HRNet achieves an AP score of 76.3 on the COCO val dataset for human pose estimation, outperforming the previous best model, SimpleBaseline, which achieved an AP score of 74.3.
Semantic Segmentation: HRNet achieves mIoU scores of 81.1 on Cityscapes val, 54.0 on PASCAL-Context, and 55.90 on LIP datasets, outperforming models like DeepLabv3 and PSPNet.
Object Detection: In the Cascade Mask R-CNN framework, HRNet outperforms common backbones like ResNet and ResNeXt. Specifically, HRNetV2p-W48 achieves an AP of 44.8 on COCO val vs. 43.1 for ResNet-101.

These improvements are consistent across various evaluation metrics, including strict evaluation criteria such as $\operatorname{AP}^{75}$ for pose estimation.

Practical and Theoretical Implications

The success of HRNet has several implications:

Practical Applications: The ability to maintain high-resolution representations end-to-end could lead to more accurate and reliable vision systems, empowering applications ranging from augmented reality to autonomous driving.
Theoretical Insight: The approach provides evidence supporting the idea that maintaining high-resolution features throughout the network pipeline can be more beneficial than the traditional low-to-high resolution recovery methods.

Future Developments

The HRNet conceptual framework opens several avenues for future research:

Extended Architectures: Future neural network designs may involve even deeper and wider parallel resolutions to further improve precision and computational efficiency.
Application Beyond Vision Tasks: While HRNet is designed for vision tasks, its parallel processing and fusion mechanisms could be adapted for other domains such as audio processing and NLP.
Integration with Other Techniques: Combining HRNet with other advancements in network design (e.g., attention mechanisms, GANs) could yield even more powerful models.

In conclusion, the HRNet introduces a highly effective approach for visual recognition tasks by leveraging parallel multi-resolution processing and repeated fusion, marking a significant advancement in deep learning model design for computer vision.

PDF Markdown

Related Papers

YouTube

Show All Videos