- The paper presents a novel CNN approach using a globally crowdsourced dataset of 110,988 images from 56 cities to quantify urban perception.
- The study introduces SS-CNN and RSS-CNN architectures, with RSS-CNN achieving a 73.5% prediction accuracy in pairwise safety judgments.
- The research offers practical insights for data-driven urban planning and socio-economic studies by linking perceptual attributes with urban design.
Deep Learning the City: Quantifying Urban Perception At A Global Scale
The paper "Deep Learning the City: Quantifying Urban Perception At A Global Scale" presents an innovative approach to the analysis of urban environments using computer vision and crowdsourcing. Authored by Dubey, Naik, Parikh, Raskar, and Hidalgo, the paper addresses the challenge of quantifying urban perception on a global scale—a task traditionally encumbered by the limitations of localized datasets and field surveys.
Overview
The paper introduces an extensive crowdsourced dataset comprising 110,988 images from 56 cities across 28 countries, with over 1.17 million pairwise image comparisons provided by 81,630 online volunteers. The images were rated on six perceptual attributes: safety, liveliness, boredom, wealth, depression, and beauty. This dataset, called Place Pulse 2.0, serves as the foundation for training a convolutional neural network (CNN) model capable of predicting human perceptual judgments from pairwise image comparisons.
Methodology
Two network architectures are proposed in the paper: Streetscore-CNN (SS-CNN) and Ranking Streetscore-CNN (RSS-CNN). The SS-CNN architecture employs a Siamese-like structure for binary classification of image pairs, while the RSS-CNN extends this capability to include ordinal ranking by incorporating ranking loss. The CNNs are fine-tuned on the Place Pulse 2.0 dataset to predict the perceptual attributes, achieving better performance compared to pre-trained models like AlexNet, PlacesNet, and VGGNet.
Numerical Results
The RSS-CNN model, particularly when initialized with VGGNet and trained using both classification and ranking loss, demonstrated a prediction accuracy of 73.5% for pairwise safety comparisons—significantly high given the complexity of the task and diversity of images. The Place Pulse 2.0 dataset's larger size and diversity also demonstrated its advantage over the original Place Pulse 1.0 dataset, as models trained on it achieved higher prediction accuracy.
Implications and Future Directions
The implications of this research are multifaceted. From a practical standpoint, the methodology could facilitate global-scale urban studies, enabling policymakers and planners to make data-driven decisions regarding urban design and resource allocation. Theoretically, it advances the discourse on the relationship between urban appearance and socio-economic outcomes, supporting studies that delve into social behavior influenced by environmental factors.
In terms of future developments, the paper points to potential investigations into the determinants of perceptual attributes and the exploration of generalizations across different geographical regions. The scaling and application of the proposed approach could extend beyond urban perception, suggesting broader utility in the perception analysis of any scene or object type, thus expanding the role of computer vision in urban studies.
In summary, the research presents a significant step toward leveraging AI and crowdsourcing for urban perception analysis, demonstrating both the potential for care efficiencies in urban studies and the foundational ability to inform interdisciplinary explorations linking urban design with social outcomes.