Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

On the Promises and Challenges of Multimodal Foundation Models for Geographical, Environmental, Agricultural, and Urban Planning Applications (2312.17016v1)

Published 23 Dec 2023 in cs.CV and cs.AI

Abstract: The advent of LLMs has heightened interest in their potential for multimodal applications that integrate language and vision. This paper explores the capabilities of GPT-4V in the realms of geography, environmental science, agriculture, and urban planning by evaluating its performance across a variety of tasks. Data sources comprise satellite imagery, aerial photos, ground-level images, field images, and public datasets. The model is evaluated on a series of tasks including geo-localization, textual data extraction from maps, remote sensing image classification, visual question answering, crop type identification, disease/pest/weed recognition, chicken behavior analysis, agricultural object counting, urban planning knowledge question answering, and plan generation. The results indicate the potential of GPT-4V in geo-localization, land cover classification, visual question answering, and basic image understanding. However, there are limitations in several tasks requiring fine-grained recognition and precise counting. While zero-shot learning shows promise, performance varies across problem domains and image complexities. The work provides novel insights into GPT-4V's capabilities and limitations for real-world geospatial, environmental, agricultural, and urban planning challenges. Further research should focus on augmenting the model's knowledge and reasoning for specialized domains through expanded training. Overall, the analysis demonstrates foundational multimodal intelligence, highlighting the potential of multimodal foundation models (FMs) to advance interdisciplinary applications at the nexus of computer vision and language.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (18)
  1. Chenjiao Tan (2 papers)
  2. Qian Cao (36 papers)
  3. Yiwei Li (107 papers)
  4. Jielu Zhang (7 papers)
  5. Xiao Yang (158 papers)
  6. Huaqin Zhao (16 papers)
  7. Zihao Wu (100 papers)
  8. Zhengliang Liu (91 papers)
  9. Hao Yang (328 papers)
  10. Nemin Wu (3 papers)
  11. Tao Tang (87 papers)
  12. Xinyue Ye (24 papers)
  13. Lilong Chai (6 papers)
  14. Ninghao Liu (98 papers)
  15. Changying Li (9 papers)
  16. Lan Mu (5 papers)
  17. Tianming Liu (161 papers)
  18. Gengchen Mai (46 papers)
Citations (5)