UrbanVLP: Multi-Granularity Vision-Language Pretraining for Urban Socioeconomic Indicator Prediction (2403.16831v3)
Abstract: Urban socioeconomic indicator prediction aims to infer various metrics related to sustainable development in diverse urban landscapes using data-driven methods. However, prevalent pretrained models, particularly those reliant on satellite imagery, face dual challenges. Firstly, concentrating solely on macro-level patterns from satellite data may introduce bias, lacking nuanced details at micro levels, such as architectural details at a place. Secondly, the text generated by the precursor work UrbanCLIP, which fully utilizes the extensive knowledge of LLMs, frequently exhibits issues such as hallucination and homogenization, resulting in a lack of reliable quality. In response to these issues, we devise a novel framework entitled UrbanVLP based on Vision-Language Pretraining. Our UrbanVLP seamlessly integrates multi-granularity information from both macro (satellite) and micro (street-view) levels, overcoming the limitations of prior pretrained models. Moreover, it introduces automatic text generation and calibration, providing a robust guarantee for producing high-quality text descriptions of urban imagery. Rigorous experiments conducted across six socioeconomic indicator prediction tasks underscore its superior performance.
- Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems 35 (2022), 23716–23736.
- Efficient poverty mapping from high resolution remote sensing images. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 12–20.
- Efficient Poverty Mapping from High Resolution Remote Sensing Images. Proceedings of the AAAI Conference on Artificial Intelligence 35, 1 (May 2021), 12–20. https://doi.org/10.1609/aaai.v35i1.16072
- Integrating Remote Sensing and Street View Images to Quantify Urban Forest Ecosystem Services. Remote Sensing 12, 2 (2020). https://doi.org/10.3390/rs12020329
- Integrating satellite and street-level images for local climate zone mapping. International Journal of Applied Earth Observation and Geoinformation 119 (2023), 103323. https://doi.org/10.1016/j.jag.2023.103323
- Integrating Aerial and Street View Images for Urban Land Use Classification. Remote Sensing 10, 10 (2018). https://doi.org/10.3390/rs10101553
- Integrating aerial and street view images for urban land use classification. Remote Sensing 10, 10 (2018), 1553.
- Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478 (2023).
- ShareGPT4V: Improving Large Multi-Modal Models with Better Captions. arXiv preprint arXiv:2311.12793 (2023).
- Uniter: Universal image-text representation learning. In European conference on computer vision. Springer, 104–120.
- InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning. arXiv:2305.06500 [cs.CV]
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. Ieee, 248–255.
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations. https://openreview.net/forum?id=YicbFdNTTy
- Improving CLIP Training with Language Rewrites. In NeurIPS.
- Urban visual intelligence: Uncovering hidden city profiles with street view images. Proceedings of the National Academy of Sciences 120, 27 (2023), e2220417120.
- A challenger to gpt-4v? early explorations of gemini in visual expertise. arXiv preprint arXiv:2312.12436 (2023).
- Efficient region embedding with multi-view spatial networks: A perspective of locality-constrained spatial autocorrelations. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 906–913.
- Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010 (2023).
- Learning to score economic development from satellite imagery. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2970–2979.
- Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 16000–16009.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
- Perceiving commerial activeness over satellite images. In Companion Proceedings of the The Web Conference 2018. 387–394.
- CLIPScore: A Reference-free Evaluation Metric for Image Captioning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.). Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 7514–7528. https://doi.org/10.18653/v1/2021.emnlp-main.595
- InfoMetIC: An Informative Metric for Reference-free Image Caption Evaluation. arXiv preprint arXiv:2305.06002 (2023).
- Comprehensive urban space representation with varying numbers of street-level images. Computers, Environment and Urban Systems 106 (2023), 102043. https://doi.org/10.1016/j.compenvurbsys.2023.102043
- Comprehensive urban space representation with varying numbers of street-level images. Computers, Environment and Urban Systems 106 (2023), 102043.
- A survey on contrastive self-supervised learning. Technologies 9, 1 (2020), 2.
- Combining satellite imagery and machine learning to predict poverty. Science 353, 6301 (2016), 790–794.
- Combining satellite imagery and machine learning to predict poverty. Science 353, 6301 (2016), 790–794. https://doi.org/10.1126/science.aaf7894 arXiv:https://www.science.org/doi/pdf/10.1126/science.aaf7894
- Unsupervised representation learning of spatial data via multimodal embedding. 1993–2002.
- Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning. PMLR, 4904–4916.
- Vilt: Vision-and-language transformer without convolution or region supervision. In International Conference on Machine Learning. PMLR, 5583–5594.
- Mark A Kramer. 1991. Nonlinear principal component analysis using autoassociative neural networks. AIChE journal 37, 2 (1991), 233–243.
- From scarcity to efficiency: Improving clip training via visual-enriched captions. arXiv preprint arXiv:2310.07699 (2023).
- Take a Look Around: Using Street View and Satellite Images to Estimate House Prices. ACM Trans. Intell. Syst. Technol. 10, 5, Article 54 (sep 2019), 19 pages. https://doi.org/10.1145/3342240
- Vilbertscore: Evaluating image caption using vision-and-language bert. In Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems. 34–39.
- Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training. In Proceedings of the AAAI conference on artificial intelligence, Vol. 34. 11336–11344.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023).
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning. PMLR, 12888–12900.
- Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10965–10975.
- MCN4Rec: Multi-Level Collaborative Neural Network for Next Location Recommendation. ACM Transactions on Information Systems ([n. d.]).
- Self-supervised contrastive representation learning for large-scale trajectories. Future Generation Computer Systems (2023).
- Predicting multi-level socioeconomic indicators from structural urban imagery. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management. 3282–3291.
- Predicting Multi-Level Socioeconomic Indicators from Structural Urban Imagery. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management (Atlanta, GA, USA) (CIKM ’22). Association for Computing Machinery, New York, NY, USA, 3282–3291. https://doi.org/10.1145/3511808.3557153
- Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm. arXiv preprint arXiv:2110.05208 (2021).
- Urban Region Embedding via Multi-View Contrastive Prediction. arXiv preprint arXiv:2312.09681 (2023).
- Geoman: Multi-level attention networks for geo-sensory time series prediction.. In IJCAI, Vol. 2018. 3428–3434.
- Fine-grained urban flow prediction. In Proceedings of the Web Conference 2021. 1833–1845.
- Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744 (2023).
- Visual instruction tuning. arXiv preprint arXiv:2304.08485 (2023).
- Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499 (2023).
- Knowledge-infused contrastive learning for urban imagery-based socioeconomic prediction. In Proceedings of the ACM Web Conference 2023. 4150–4160.
- ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Curran Associates Inc., Red Hook, NY, USA.
- Mapbox. [n. d.]. Mapbox - Location Data & Maps for Developers. https://www.mapbox.com/
- Fully convolutional recurrent networks for multidate crop recognition from multitemporal image sequences. ISPRS Journal of Photogrammetry and Remote Sensing 171 (2021), 188–201.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 35 (2022), 27730–27744.
- Poverty prediction with public landsat 7 satellite imagery and machine learning. arXiv preprint arXiv:1711.03654 (2017).
- Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952 (2023).
- Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.
- A survey of hallucination in large foundation models. arXiv preprint arXiv:2309.05922 (2023).
- Multi-modal Based Region Representation Learning Considering Mobility Data in Seoul. Procedia Computer Science 220 (2023), 251–258. https://doi.org/10.1016/j.procs.2023.03.153 The 14th International Conference on Ambient Systems, Networks and Technologies Networks (ANT) and The 6th International Conference on Emerging Data and Industry 4.0 (EDI40).
- NNEval: Neural network based evaluation metric for image captioning. In Proceedings of the European Conference on Computer Vision (ECCV). 37–53.
- Hao Hao Tan and Mohit Bansal. 2019. LXMERT: Learning Cross-Modality Encoder Representations from Transformers. In Conference on Empirical Methods in Natural Language Processing. https://api.semanticscholar.org/CorpusID:201103729
- Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34 (2021), 200–212.
- Learning to interpret satellite images in global scale using wikipedia. arXiv preprint arXiv:1905.02506 (2019).
- Attention is all you need. Advances in neural information processing systems 30 (2017).
- GeoCLIP: Clip-Inspired Alignment between Locations and Images for Effective Worldwide Geo-localization. (2023).
- Deep transfer learning for crop yield prediction with remote sensing data. In Proceedings of the 1st ACM SIGCAS Conference on Computing and Sustainable Societies. 1–5.
- Hongjian Wang and Zhenhui Li. 2017. Region Representation Learning via Mobility Flow. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management (Singapore, Singapore) (CIKM ’17). Association for Computing Machinery, New York, NY, USA, 237–246. https://doi.org/10.1145/3132847.3133006
- Urban2vec: Incorporating street view imagery and pois for multi-modal urban neighborhood embedding. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 1013–1020.
- Socioecologically informed use of remote sensing data to predict rural household poverty. Proceedings of the National Academy of Sciences 116, 4 (2019), 1213–1218. https://doi.org/10.1073/pnas.1812969116 arXiv:https://www.pnas.org/doi/pdf/10.1073/pnas.1812969116
- Yanwei Yu Yongguo Jiang Junyu Dong Wei Chen, Chao Huang. 2023. Trajectory-User Linking via Hierarchical Spatio-Temporal Attention Networks. ACM Transactions on Knowledge Discovery from Data (2023).
- Beyond the first law of geography: Learning representations of satellite imagery by leveraging point-of-interests. In Proceedings of the ACM Web Conference 2022. 3308–3316.
- When Urban Region Profiling Meets Large Language Models. arXiv preprint arXiv:2310.18340 (2023).
- The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421 9, 1 (2023).
- FILIP: Fine-grained Interactive Language-Image Pre-Training. In International Conference on Learning Representations. https://openreview.net/forum?id=cpDhcsEDC2
- Classifying land-use patterns by integrating time-series electricity data and high-spatial resolution remote sensing imagery. International Journal of Applied Earth Observation and Geoinformation 106 (2022), 102664. https://doi.org/10.1016/j.jag.2021.102664
- mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178 (2023).
- Sustainbench: Benchmarks for monitoring the sustainable development goals with machine learning. arXiv preprint arXiv:2111.04724 (2021).
- Deep gaussian process for crop yield prediction based on remote sensing data. In Proceedings of the AAAI conference on artificial intelligence, Vol. 31.
- Glipv2: Unifying localization and vision-language understanding. Advances in Neural Information Processing Systems 35 (2022), 36067–36080.
- Multi-view joint graph representation learning for urban region embedding. In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence. 4431–4437.
- BERTScore: Evaluating Text Generation with BERT. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net. https://openreview.net/forum?id=SkeHuCVFDr
- A survey of large language models. arXiv preprint arXiv:2303.18223 (2023).
- Urban computing: concepts, methodologies, and applications. ACM Transactions on Intelligent Systems and Technology (TIST) 5, 3 (2014), 1–55.
- Long time series nighttime light dataset of China (2000–2020). Digit. J. Glob. Change Data Repos 6 (2022).