Scene recognition with CNNs: objects, scales and dataset bias (1801.06867v1)

Published 21 Jan 2018 in cs.CV

Abstract: Since scenes are composed in part of objects, accurate recognition of scenes requires knowledge about both scenes and objects. In this paper we address two related problems: 1) scale induced dataset bias in multi-scale convolutional neural network (CNN) architectures, and 2) how to combine effectively scene-centric and object-centric knowledge (i.e. Places and ImageNet) in CNNs. An earlier attempt, Hybrid-CNN, showed that incorporating ImageNet did not help much. Here we propose an alternative method taking the scale into account, resulting in significant recognition gains. By analyzing the response of ImageNet-CNNs and Places-CNNs at different scales we find that both operate in different scale ranges, so using the same network for all the scales induces dataset bias resulting in limited performance. Thus, adapting the feature extractor to each particular scale (i.e. scale-specific CNNs) is crucial to improve recognition, since the objects in the scenes have their specific range of scales. Experimental results show that the recognition accuracy highly depends on the scale, and that simple yet carefully chosen multi-scale combinations of ImageNet-CNNs and Places-CNNs, can push the state-of-the-art recognition accuracy in SUN397 up to 66.26% (and even 70.17% with deeper architectures, comparable to human performance).

Authors (3)

Luis Herranz (46 papers)
Shuqiang Jiang (30 papers)
Xiangyang Li (58 papers)

Citations (205)

View on Semantic Scholar

Summary

Analysis of "Scene Recognition with CNNs: Objects, Scales and Dataset Bias"

This paper by Herranz, Jiang, and Li addresses critical challenges in the domain of scene recognition, specifically focusing on the interactions between objects, scales, and dataset bias in convolutional neural networks (CNNs). The paper advances the understanding of how neural networks can be tailored for better scene recognition, by leveraging both object-centric and scene-centric datasets, namely ImageNet and Places.

Fundamental Insights

The central thesis of the paper is predicated on two primary issues: scale-induced dataset bias in multi-scale CNN architectures and the integration of scene-centric and object-centric knowledge. The paper posits that objects within a scene occupy various scales; hence, a one-size-fits-all approach for feature extraction using a single CNN may be suboptimal. This paper underscores the significant role of scale in how datasets are interpreted by CNNs, describing how ImageNet-CNNs and Places-CNNs exhibit strengths in different scale ranges.

Scale-Specific CNNs

A notable contribution of the paper is the introduction of scale-specific CNNs to counteract dataset bias. The authors delineate how both ImageNet-CNNs and Places-CNNs can be adapted to their respective optimal scale ranges. This adaptation involves using different CNNs for varying scale ranges instead of a generic one for all scales, a strategy which yielded considerable improvements in recognition performance.

Experimental Results

A series of extensive experiments were conducted, highlighting the limitations of using a fixed CNN model across multiple scales. The authors demonstrate how a strategic combination of ImageNet-CNNs for object scales and Places-CNNs for scene scales can lead to substantial gains, achieving state-of-the-art results, particularly on the SUN397 dataset, with recognition accuracy reaching up to 70.17% using deeper architectures. The paper's experimental outcomes validate the hypothesis that dataset bias is intricately tied to the scale, and by mitigating this bias, superior performance can be realized.

Implications for Future Research

The incorporation of scale-specific features and fine-tuning strategies as outlined provides a roadmap for further advancements in computer vision tasks involving scene recognition. The authors' methodology opens avenues for exploring how multi-scale architectures can be further refined, potentially including even more sophisticated pooling methods or integrating different network architectures like residual networks.

Theoretical and Practical Implications

Theoretically, this research contributes to the discourse on dataset bias and scale invariance in neural networks, proposing a model that leverages hierarchical structuring of features. Practically, this approach is significant for real-world applications where scenes are complex and multi-scaled, such as autonomous navigation and surveillance, where robust scene understanding is essential.

Future Directions

Potential future directions could focus on expanding the scale-specific approach to explore its efficacy across other domains of visual recognition. Moreover, further research could investigate the integration of additional datasets or novel CNN architectures to enhance context awareness and scalability. Additionally, assessing the trade-offs between computational efficiency and recognition accuracy in scale-specific multi-scale CNNs would be valuable.

In summary, the paper presents a compelling argument for the use of scale-specific CNNs to address scale-induced dataset bias, providing both an empirical and theoretical foundation to enhance scene recognition capabilities in neural networks.

PDF Markdown

Related Papers

Find Related Papers