Learning Cross-Modal Deep Representations for Robust Pedestrian Detection (1704.02431v2)

Published 8 Apr 2017 in cs.CV

Abstract: This paper presents a novel method for detecting pedestrians under adverse illumination conditions. Our approach relies on a novel cross-modality learning framework and it is based on two main phases. First, given a multimodal dataset, a deep convolutional network is employed to learn a non-linear mapping, modeling the relations between RGB and thermal data. Then, the learned feature representations are transferred to a second deep network, which receives as input an RGB image and outputs the detection results. In this way, features which are both discriminative and robust to bad illumination conditions are learned. Importantly, at test time, only the second pipeline is considered and no thermal data are required. Our extensive evaluation demonstrates that the proposed approach outperforms the state-of- the-art on the challenging KAIST multispectral pedestrian dataset and it is competitive with previous methods on the popular Caltech dataset.

Authors (5)

Dan Xu (120 papers)
Wanli Ouyang (358 papers)
Elisa Ricci (137 papers)
Xiaogang Wang (230 papers)
Nicu Sebe (270 papers)

Citations (188)

View on Semantic Scholar

Summary

Insights on Cross-Modal Learning for Pedestrian Detection

In the academic paper titled "Learning Cross-Modal Deep Representations for Robust Pedestrian Detection," the authors propose a novel approach leveraging cross-modal deep learning to enhance pedestrian detection capabilities under challenging illumination conditions. Their research is concerned with a pervasive issue in computer vision, where detecting pedestrians in environments with poor lighting remains problematic. The methodology introduced in this work is structured around a two-phase framework utilizing Convolutional Neural Networks (CNNs) to bridge the gap between RGB and thermal data, thereby enhancing detection accuracy without the need for thermal inputs during testing.

Methodological Overview

The paper's innovation lies in its use of a cross-modality learning framework that comprises two primary phases:

Region Reconstruction Network (RRN): The initial phase involves a deep CNN designed to learn a non-linear mapping between RGB and thermal image pairs. This unsupervised learning process builds upon pedestrian proposals generated by a generic detector (ACF), focusing on regions likely to contain pedestrians. Utilizing the VGG-13 model architecture, the RRN learns to reconstruct thermal images from RGB inputs, effectively modeling the correlation between these two data modalities.
Multi-Scale Detection Network (MSDN): Post the training of RRN, the learned representations are integrated into a second CNN, the MSDN. This network, operating on RGB data alone, incorporates multi-scale information for robust pedestrian detection. Sub-Net B within the MSDN leverages weights transferred from the RRN, combining them with multi-scale RGB features to refine detection accuracy, particularly under adverse lighting conditions.

Experimental Evaluation

The authors rigorously evaluate their framework using two datasets: the KAIST multispectral pedestrian dataset and the Caltech pedestrian dataset. Their experiments substantiate the efficacy of cross-modal representation learning, demonstrating substantial improvements over state-of-the-art methods. On the KAIST dataset, CMT-CNN achieved a significant reduction in miss-rate compared to methods that rely solely on RGB data, or even on combined RGB and thermal input without the proposed method's learning structure.

In contrast to other techniques that exploit both RGB and thermal data during testing, the paper emphasizes the advantage of not requiring thermal data in the test phase, which reduces system costs and complexity while maintaining competitive performance. The results on the Caltech dataset further confirm the framework's transferability and robustness, notedly without requiring manual pedestrian annotation in the thermal domain.

Implications and Future Directions

This research contributes significantly to the field of pedestrian detection by presenting a scalable method that embraces multispectral data advantages without incurring their typical deployment costs. The concept of unsupervised cross-modal learning in CNNs has implications beyond pedestrian detection, suggesting potential applications in areas requiring robustness to changing conditions such as depth image reconstruction for RGB-D tasks or applications across different sensor modalities in autonomous systems.

Future research might explore enhancing the framework’s scalability to other object detection tasks and integrating additional modalities such as lidar or radar for autonomous vehicle and robotics applications. Further, optimizing the architecture for lower computational costs while maintaining high accuracy could be crucial for real-time applications. The potential for cross-domain representation learning to facilitate adaptive perception systems remains a fertile ground for innovation inspired by this work.

PDF Markdown

Related Papers

Find Related Papers