Urban Visual Perception Survey
- Urban Visual Perception Survey is a framework that quantifies how individuals perceive urban streetscapes, including attributes like safety, beauty, and liveliness.
- It employs controlled annotation protocols, comprehensive image sampling, and crowdsourced ratings to generate statistically robust perceptual scores.
- The methodology informs urban planning and participatory design while addressing challenges in model bias, dynamic imagery, and interpretability.
Urban Visual Perception Survey is a methodological and computational framework for quantifying, modeling, and analyzing how individuals and groups perceive urban streetscapes, primarily via image-based stimuli. These surveys operationalize subjective experience—such as safety, beauty, liveliness, and greenery—through structured annotation protocols, large-scale crowdsourcing, and benchmarking of artificial intelligence models against varied human ratings, thereby informing urban analytics, participatory design, and planning intervention at city scale (Mushkani, 18 Sep 2025).
1. Survey Design Principles and Protocols
Urban Visual Perception Surveys systematically elicit multidimensional perceptual judgments about urban scenes, leveraging photographic or photorealistic street imagery. Surveys typically employ a controlled annotation protocol structured around fixed dimensions. For example, Mushkani et al. (Mushkani, 18 Sep 2025) specified 30 perceptual “dimensions” organized into thematic families:
- Physical Setting: Space Typology, Spatial Configuration, Lighting, Vegetation, Maintenance, Signage, Barriers
- Human Presence & Activity: Human Presence, Types of Activities, Economic Activities, Accessibility Features, Visibility
- Built Form & Aesthetics: Built Environment, Architectural Style, Aesthetic and Cultural Elements
- Subjective Impressions: Overall Impression, Sustainability, Public Amenities
Allocation of single-choice (exactly one label) versus multi-label (any subset) annotation enforces disambiguation between objective and composite properties. Annotation campaigns frequently recruit local participants—e.g., twelve from seven Montreal community organizations (Mushkani, 18 Sep 2025) or balanced, demographically diverse samples across 45 nationalities in SPECS (Quintana et al., 19 May 2025, Quintana et al., 19 Dec 2025)—with overlap ensuring every image is rated by multiple respondents.
Annotation is often conducted in the local dominant language, with later normalization to a canonical coding scheme. Consensus labels for quantitative evaluation are obtained either via majority vote (for single-choice items) or ≥50% threshold agreement (for multi-label items), with ties and “Not applicable” selections excluded from scoring.
Instrument design includes both relative (pairwise comparison, e.g. “Which place looks greener?” (Quintana et al., 19 Dec 2025, Quintana et al., 19 May 2025)) and absolute (e.g., 1–5 Likert ratings (Danish et al., 29 Feb 2024)) judgment protocols, with strengths and limitations for reconstructing scale stability and throughput.
2. Image Data Sources, Sampling, and Preprocessing
Modern surveys operate over large, spatially representative corpora leveraging street-view imagery (SVI) from global providers—Google (Mushkani, 18 Sep 2025, Dubey et al., 2016, Muller et al., 2022), Mapillary (Danish et al., 29 Feb 2024, Quintana et al., 19 Dec 2025, Quintana et al., 19 May 2025), Baidu (Liu et al., 2016, Lan, 5 Jun 2025)—or city-custom synthetic scenes (Mushkani, 18 Sep 2025). Datasets may include:
- Uniform spatial sampling at 20–200 m intervals along street networks to ensure fine-grained coverage (Muller et al., 2022, Danish et al., 29 Feb 2024)
- Rich visual heterogeneity, spanning seasonality, lighting, activity levels, and built-form diversity (Mushkani, 18 Sep 2025)
- Balanced inclusion of real and synthetic (photorealistic render) images to probe model generalization (Mushkani, 18 Sep 2025)
Preprocessing pipelines frequently apply semantic segmentation to extract scene elements (vegetation, road, sidewalk, sky, building, etc.) (Malekzadeh et al., 4 Nov 2025, Quintana et al., 19 Dec 2025), filter low-quality or ambiguous images, and perform context-aware cropping (e.g., road-center detection, field-of-view filtering) (Danish et al., 29 Feb 2024). Open pipelines are advocated for reproducibility and FAIR compliance, with released scripts and standard schema for spatial tiling, normalization, and metadata retention (Danish et al., 29 Feb 2024).
3. Annotation Aggregation, Scoring, and Human Agreement Metrics
Perception surveys transform raw annotations into statistically robust scores at image-by-dimension or image-by-indicator resolution:
- Consensus Mechanisms: Single-choice items use majority-vote; multi-labels adopt inclusion thresholds (e.g., ≥50% annotators). Exact ties are marked as ambiguous or missing (Mushkani, 18 Sep 2025).
- Continuous Scoring: For large-scale pairwise protocols, TrueSkill (Dubey et al., 2016, Muller et al., 2022, Nadai et al., 2016) and Q-score (Schedule of Strength) (Quintana et al., 19 May 2025, Quintana et al., 19 Dec 2025) generate continuous latent scores for each image along each perceptual attribute, bounded to fixed ranges (e.g., [0,10] or normalized [0,1]).
- Reliability Indices: Inter-annotator agreement is measured by Krippendorff’s alpha (α) for nominal items, pairwise Jaccard overlap for multilabels, and Cronbach’s α for scale consistency (Mushkani, 18 Sep 2025, Malekzadeh et al., 4 Nov 2025, Danish et al., 29 Feb 2024). Dimensions with low α correspond to more subjective, ambiguous appraisal and weaker human–model alignment (Mushkani, 18 Sep 2025).
Empirical benchmarks typically report aggregate metrics: mean accuracy (for single-choice properties), mean Jaccard index (for multi-label), and macro-averages across all dimensions (Mushkani, 18 Sep 2025, Lan, 5 Jun 2025).
4. Modeling and Evaluation of Human-Machine Alignment
State-of-the-art evaluations leverage zero-shot large vision-LLMs (VLMs) and multimodal LLMs (MLLMs)—including claude-sonnet, gpt-4.1, openai-o4-mini, gemini-2.5-pro, llama-4-maverick, etc.—to probe their alignment with human urban scene perception without task-specific fine-tuning (Mushkani, 18 Sep 2025, Lan, 5 Jun 2025, He et al., 26 Sep 2025). Key workflow elements:
- Image-Prompt Encoding: Images encoded as base64 strings with structured prompts enumerating all perception dimensions and definitions (Mushkani, 18 Sep 2025).
- Output Parsing: Deterministic parsers enforce format compliance, map tokens to canonical labels, and handle non-conforming responses (Mushkani, 18 Sep 2025).
- Agreement Scoring: Single-choice attributes scored by accuracy (fraction of answers matching consensus); multi-label by Jaccard overlap between predicted and consensus label sets (Mushkani, 18 Sep 2025). Human–model agreement is macro-averaged.
Benchmark results consistently show higher model–human agreement on objective, visually-grounded dimensions (vegetation, spatial configuration, seating) than on subjective, diffuse appraisals (cultural elements, overall impression) (Mushkani, 18 Sep 2025, Lan, 5 Jun 2025). Model scores correlate positively with inter-annotator reliability, indicating alignment is easier where human consensus is stronger.
Multi-city and cross-cultural studies (e.g., UrbanFeel (He et al., 26 Sep 2025), SPECS (Quintana et al., 19 May 2025, Quintana et al., 19 Dec 2025), Place Pulse (Dubey et al., 2016)) demonstrate that spatial, temporal, and demographic coverage is critical for both human and model generalizability; however, current models exhibit systematic biases (overestimating positive, underestimating negative indicators; sensitivity to “city identity” cues) (Quintana et al., 19 May 2025, He et al., 26 Sep 2025).
5. Sociodemographic, Cognitive, and Contextual Moderators
Urban Visual Perception Surveys increasingly account for observer heterogeneity. The SPECS dataset, for example, rigorously sampled 1,000 participants across five cities and 45 nationalities, capturing balanced distributions of gender, age, income, education, and the full Big Five personality inventory (Quintana et al., 19 May 2025, Quintana et al., 19 Dec 2025). Results show:
- Demographics: Age and gender produced the most frequent and significant differences across all perception indicators. For example, elderly and female participants consistently rate scenes as less safe, more boring, or more depressing (Quintana et al., 19 May 2025, Beneduce et al., 1 Mar 2025).
- Personality: Conscientiousness and extraversion moderated several indicators (e.g., safety, beauty, liveliness), though effect sizes are commonly weaker than for demographic variables (Quintana et al., 19 May 2025, Quintana et al., 19 Dec 2025).
- Geographic anchor: Where participants live emerges as a dominant predictor of certain perceptions—e.g., greenness—suggesting that cultural/experiential baselines modulate human interpretations more than nominal demographic membership (Quintana et al., 19 Dec 2025).
- Location-based sentiment transfer: Residents rate their own cities through a familiarity or positivity lens; cross-city benchmarking must account for z-score shifts in baseline appraisal (Quintana et al., 19 May 2025, Quintana et al., 19 Dec 2025).
- Model bias: Machine-learning models trained on pooled datasets tend to overpredict positive attributes (safe, lively, wealthy, beautiful) and underpredict negative ones (boring, depressing) relative to local or demographically specific human judgments (Quintana et al., 19 May 2025, Quintana et al., 19 Dec 2025).
6. Implications for Participatory Design, Urban Analytics, and Model Deployment
These methodological advances have concrete implications for urban analytics, planning, participatory governance, and AI-assisted decision support:
- Participatory review and pre-annotation: VLM outputs on visually grounded items can be deployed as pre-annotations to accelerate audit and review; subjective and low-consensus dimensions require participatory correction (Mushkani, 18 Sep 2025, Lan, 5 Jun 2025).
- Uncertainty and reliability surfacing: Reporting both model uncertainty (e.g., “Not applicable” predictions) and observed human agreement rates for each dimension supports participatory evaluation and reflexive model usage (Mushkani, 18 Sep 2025).
- Hybrid workflows: Integrating scalable SVI-based feature extraction with in situ, context-rich methods such as Public Participation GIS (PPGIS) captures both broad visual patterns and nuanced lived experience. Hybrid weighting formularies, such as , combine visual model predictions () and normalized experiential survey scores () for given locations (Malekzadeh et al., 4 Nov 2025).
- Spatial analytics and city benchmarking: Aggregate predicted perception maps at fine spatial scale (e.g., per Output Area in Greater London (Muller et al., 2022)) expose inequities and temporal dynamics. Cross-city comparisons highlight the context specificity of perceptual baselines (Quintana et al., 19 May 2025).
- Equity and inclusion: Relying solely on global models risks missing subpopulation needs; planning interventions must be tailored to local, demographically stratified perceptions (Quintana et al., 19 May 2025, Quintana et al., 19 Dec 2025).
- Open-source, reproducible pipelines: Toolkits implementing open SVI ingestion, rigorous image preparation, and web-based annotation support rapid deployment and external audit (Danish et al., 29 Feb 2024).
Extensions proposed include scaling surveys across more cities, languages, and temporal datasets; integrating multi-turn, conversational prompting; leveraging uncertainty-aware objectives; and developing interaction-aware explainability modules (Mushkani, 18 Sep 2025, Lan, 5 Jun 2025, He et al., 26 Sep 2025).
7. Limitations and Future Research Directions
Key limitations of current Urban Visual Perception Surveys include:
- Sample size and annotation diversity: Many bench-marked datasets remain modest (e.g., 100 images in Montreal (Mushkani, 18 Sep 2025)), and more extensive participatory annotation is needed for robust generalization.
- Static imagery: Most pipelines evaluate perception from single timepoint, static images, omitting temporal cues, dynamic activity, and multisensory features (noise, smell, crowd density) that shape human experience (Malekzadeh et al., 4 Nov 2025, Lan, 5 Jun 2025).
- Subjectivity in complex dimensions: Dimensions with low inter-rater reliability (e.g., “Design,” “Cultural Elements”) remain elusive for both human consensus and machine modeling, limiting interpretability and usefulness for some planning targets (Mushkani, 18 Sep 2025).
- Bias and generalization: Current global models trained on existing datasets (e.g., Place Pulse, SPECS) exhibit regional, demographic, and rating biases, overestimating positives and underestimating negatives. Systematic efforts to debias, fine-tune, and validate across local populations and sociotechnical contexts remain priorities (Quintana et al., 19 May 2025, Quintana et al., 19 Dec 2025, He et al., 26 Sep 2025).
- Interpretability: While post-hoc explainability modules (e.g., attention heatmaps, natural-language rationales) improve transparency, subjective assessments are not guaranteed to align with end-user understanding (Lan, 5 Jun 2025).
Ongoing and future research directions include expanding to longitudinal/temporal evaluations, modeling user–scene interaction in real time, refining semantic segmentation to richer perceptual attributes, and exploring hybrid human–AI workflows for participatory, inclusive, and uncertainty-aware urban analytics (Mushkani, 18 Sep 2025, Lan, 5 Jun 2025, Malekzadeh et al., 4 Nov 2025, He et al., 26 Sep 2025).