Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The curse of language biases in remote sensing VQA: the role of spatial attributes, language diversity, and the need for clear evaluation (2311.16782v1)

Published 28 Nov 2023 in cs.CV and cs.AI

Abstract: Remote sensing visual question answering (RSVQA) opens new opportunities for the use of overhead imagery by the general public, by enabling human-machine interaction with natural language. Building on the recent advances in natural language processing and computer vision, the goal of RSVQA is to answer a question formulated in natural language about a remote sensing image. Language understanding is essential to the success of the task, but has not yet been thoroughly examined in RSVQA. In particular, the problem of language biases is often overlooked in the remote sensing community, which can impact model robustness and lead to wrong conclusions about the performances of the model. Thus, the present work aims at highlighting the problem of language biases in RSVQA with a threefold analysis strategy: visual blind models, adversarial testing and dataset analysis. This analysis focuses both on model and data. Moreover, we motivate the use of more informative and complementary evaluation metrics sensitive to the issue. The gravity of language biases in RSVQA is then exposed for all of these methods with the training of models discarding the image data and the manipulation of the visual input during inference. Finally, a detailed analysis of question-answer distribution demonstrates the root of the problem in the data itself. Thanks to this analytical study, we observed that biases in remote sensing are more severe than in standard VQA, likely due to the specifics of existing remote sensing datasets for the task, e.g. geographical similarities and sparsity, as well as a simpler vocabulary and question generation strategies. While new, improved and less-biased datasets appear as a necessity for the development of the promising field of RSVQA, we demonstrate that more informed, relative evaluation metrics remain much needed to transparently communicate results of future RSVQA methods.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (65)
  1. Don’t Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering, in: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, UT. pp. 4971–4980. URL: https://ieeexplore.ieee.org/document/8578620/, doi:10.1109/CVPR.2018.00522.
  2. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering, in: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, UT. pp. 6077–6086. URL: https://ieeexplore.ieee.org/document/8578734/, doi:10.1109/CVPR.2018.00636.
  3. Neural Module Networks, in: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Las Vegas, NV, USA. pp. 39–48. URL: http://ieeexplore.ieee.org/document/7780381/, doi:10.1109/CVPR.2016.12.
  4. VQA: Visual Question Answering, pp. 2425–2433. URL: https://openaccess.thecvf.com/content_iccv_2015/html/Antol_VQA_Visual_Question_ICCV_2015_paper.html.
  5. Bi-Modal Transformer-Based Approach for Visual Question Answering in Remote Sensing Imagery. IEEE Transactions on Geoscience and Remote Sensing 60, 1–11. URL: https://ieeexplore.ieee.org/document/9832935/, doi:10.1109/TGRS.2022.3192460.
  6. MUTAN: Multimodal Tucker Fusion for Visual Question Answering, in: 2017 IEEE International Conference on Computer Vision (ICCV), IEEE, Venice. pp. 2631–2639. URL: http://ieeexplore.ieee.org/document/8237547/, doi:10.1109/ICCV.2017.285.
  7. Commonsense Knowledge Reasoning and Generation with Pre-trained Language Models: A Survey. Proceedings of the AAAI Conference on Artificial Intelligence 36, 12317–12325. URL: https://ojs.aaai.org/index.php/AAAI/article/view/21496, doi:10.1609/aaai.v36i11.21496.
  8. Language Models are Few-Shot Learners, in: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F., Lin, H. (Eds.), Advances in Neural Information Processing Systems, Curran Associates, Inc.. pp. 1877–1901. URL: https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
  9. Deep Learning for the Earth Sciences: A Comprehensive Approach to Remote Sensing, Climate Science, and Geosciences. 1 ed., Wiley. URL: https://onlinelibrary.wiley.com/doi/book/10.1002/9781119646181, doi:10.1002/9781119646181.
  10. Image-text Retrieval: A Survey on Recent Research and Development, in: Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, International Joint Conferences on Artificial Intelligence Organization, Vienna, Austria. pp. 5410–5417. URL: https://www.ijcai.org/proceedings/2022/759, doi:10.24963/ijcai.2022/759.
  11. How to find a good image-text embedding for remote sensing visual question answering?, in: ECML-PKDD 2021 (MACLEAN workshop). URL: http://arxiv.org/abs/2109.11848. arXiv: 2109.11848.
  12. Language Transformers for Remote Sensing Visual Question Answering, in: IGARSS 2022 - 2022 IEEE International Geoscience and Remote Sensing Symposium, IEEE, Kuala Lumpur, Malaysia. pp. 4855–4858. URL: https://ieeexplore.ieee.org/document/9884036/, doi:10.1109/IGARSS46834.2022.9884036.
  13. Multi-task prompt-rsvqa to explicitly count objects on aerial images, in: Workshop on Machine Vision for Earth Observation (MVEO) at the 34th British Machine Vision Conference (BMVC).
  14. Prompt-RSVQA: Prompting Visual Context to a Language Model for Remote Sensing Visual Question Answering, pp. 1372–1381. URL: https://openaccess.thecvf.com/content/CVPR2022W/EarthVision/html/Chappuis_Prompt-RSVQA_Prompting_Visual_Context_to_a_Language_Model_for_Remote_CVPRW_2022_paper.html.
  15. UNITER: UNiversal Image-TExt Representation Learning, in: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M. (Eds.), Computer Vision – ECCV 2020, Springer International Publishing, Cham. pp. 104–120. doi:10.1007/978-3-030-58577-8_7.
  16. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation, in: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Doha, Qatar. pp. 1724–1734. URL: http://aclweb.org/anthology/D14-1179, doi:10.3115/v1/D14-1179.
  17. ImageNet: A large-scale hierarchical image database, in: 2009 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Miami, FL. pp. 248–255. URL: https://ieeexplore.ieee.org/document/5206848/, doi:10.1109/CVPR.2009.5206848.
  18. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics, Minneapolis, Minnesota. pp. 4171–4186. URL: http://aclweb.org/anthology/N19-1423, doi:10.18653/v1/N19-1423.
  19. Embedding Spatial Relations in Visual Question Answering for Remote Sensing, in: 2022 26th International Conference on Pattern Recognition (ICPR), IEEE, Montreal, QC, Canada. pp. 310–316. URL: https://ieeexplore.ieee.org/document/9956401/, doi:10.1109/ICPR56361.2022.9956401.
  20. Cross-Modal Visual Question Answering for Remote Sensing Data, in: 2021 Digital Image Computing: Techniques and Applications (DICTA), IEEE, Gold Coast, Australia. pp. 1–9. URL: https://ieeexplore.ieee.org/document/9647287/, doi:10.1109/DICTA52665.2021.9647287.
  21. Making the v in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering, pp. 6904–6913. URL: https://openaccess.thecvf.com/content_cvpr_2017/html/Goyal_Making_the_v_CVPR_2017_paper.html.
  22. SwapMix: Diagnosing and Regularizing the Over-Reliance on Visual Context in Visual Question Answering, in: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, New Orleans, LA, USA. pp. 5068–5078. URL: https://ieeexplore.ieee.org/document/9879464/, doi:10.1109/CVPR52688.2022.00502.
  23. Deep Residual Learning for Image Recognition, pp. 770–778. URL: https://openaccess.thecvf.com/content_cvpr_2016/html/He_Deep_Residual_Learning_CVPR_2016_paper.html.
  24. Distilling the Knowledge in a Neural Network, in: The proceedings of Deep Learning Worshop, arXiv. URL: http://arxiv.org/abs/1503.02531. arXiv:1503.02531 [cs, stat].
  25. GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering, pp. 6700–6709. URL: https://openaccess.thecvf.com/content_CVPR_2019/html/Hudson_GQA_A_New_Dataset_for_Real-World_Visual_Reasoning_and_Compositional_CVPR_2019_paper.html.
  26. The Principles of Data-Centric AI (DCAI) URL: https://arxiv.org/abs/2211.14611, doi:10.48550/ARXIV.2211.14611. publisher: arXiv Version Number: 1.
  27. CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning, in: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Honolulu, HI. pp. 1988–1997. URL: https://ieeexplore.ieee.org/document/8099698/, doi:10.1109/CVPR.2017.215.
  28. Roses are Red, Violets are Blue… But Should VQA expect Them To?, in: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Nashville, TN, USA. pp. 2775–2784. URL: https://ieeexplore.ieee.org/document/9578217/, doi:10.1109/CVPR46437.2021.00280.
  29. Adam: A Method for Stochastic Optimization, in: International Conference on Learning Representations (ICLR 2015), San Diego, California. URL: http://arxiv.org/abs/1412.6980. arXiv: 1412.6980.
  30. Skip-Thought Vectors, in: arXiv:1506.06726 [cs]. URL: http://arxiv.org/abs/1506.06726. arXiv: 1506.06726.
  31. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. International Journal of Computer Vision 123, 32–73. URL: http://link.springer.com/10.1007/s11263-016-0981-7, doi:10.1007/s11263-016-0981-7.
  32. Challenges and Solutions for Utilizing Earth Observations in the ”Big Data” era URL: https://zenodo.org/record/2391936, doi:10.5281/ZENODO.2391936. publisher: Zenodo.
  33. Microsoft COCO: Common Objects in Context, in: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (Eds.), Computer Vision – ECCV 2014. Springer International Publishing, Cham. volume 8693, pp. 740–755. URL: http://link.springer.com/10.1007/978-3-319-10602-1_48, doi:10.1007/978-3-319-10602-1_48. series Title: Lecture Notes in Computer Science.
  34. RoBERTa: A Robustly Optimized BERT Pretraining Approach. URL: http://arxiv.org/abs/1907.11692. arXiv:1907.11692 [cs].
  35. RSVQA meets BigEarthNet: a new, large-scale, visual question answering dataset for remote sensing, IEEE.
  36. Better Generic Objects Counting when Asking Questions to Images: A Multitask Approach for Remote Sensing Visual Question Answering. ISPRS Annals of Photogrammetry, Remote Sensing and Spatial Information Sciences V-2-2020, 1021–1027. URL: https://www.isprs-ann-photogramm-remote-sens-spatial-inf-sci.net/V-2-2020/1021/2020/, doi:10.5194/isprs-annals-V-2-2020-1021-2020.
  37. RSVQA: Visual Question Answering for Remote Sensing Data. IEEE Transactions on Geoscience and Remote Sensing 58, 8555–8566. URL: https://ieeexplore.ieee.org/document/9088993/, doi:10.1109/TGRS.2020.2988782.
  38. Efficient Estimation of Word Representations in Vector Space. URL: http://arxiv.org/abs/1301.3781. arXiv:1301.3781 [cs].
  39. Counterfactual VQA: A Cause-Effect Look at Language Bias, in: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Nashville, TN, USA. pp. 12695–12705. URL: https://ieeexplore.ieee.org/document/9578738/, doi:10.1109/CVPR46437.2021.01251.
  40. Im2Text: Describing Images Using 1 Million Captioned Photographs, in: Shawe-Taylor, J., Zemel, R., Bartlett, P., Pereira, F., Weinberger, K.Q. (Eds.), Advances in Neural Information Processing Systems, Curran Associates, Inc. URL: https://proceedings.neurips.cc/paper/2011/file/5dd9db5e033da9c6fb5ba83c7a7ebea9-Paper.pdf.
  41. FloodNet: A High Resolution Aerial Imagery Dataset for Post Flood Scene Understanding. arXiv:2012.02951 [cs] URL: http://arxiv.org/abs/2012.02951. arXiv: 2012.02951.
  42. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter, in: 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing, arXiv. URL: http://arxiv.org/abs/1910.01108. arXiv:1910.01108 [cs].
  43. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning, in: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Melbourne, Australia. pp. 2556–2565. URL: http://aclweb.org/anthology/P18-1238, doi:10.18653/v1/P18-1238.
  44. Multi-modal fusion transformer for visual question answering in remote sensing, in: Pierdicca, N., Bruzzone, L., Bovolo, F. (Eds.), Image and Signal Processing for Remote Sensing XXVIII, SPIE, Berlin, Germany. p. 21. URL: https://www.spiedigitallibrary.org/conference-proceedings-of-spie/12267/2636276/Multi-modal-fusion-transformer-for-visual-question-answering-in-remote/10.1117/12.2636276.full, doi:10.1117/12.2636276.
  45. Bigearthnet: A Large-Scale Benchmark Archive for Remote Sensing Image Understanding, in: IGARSS 2019 - 2019 IEEE International Geoscience and Remote Sensing Symposium, IEEE, Yokohama, Japan. pp. 5901–5904. URL: https://ieeexplore.ieee.org/document/8900532/, doi:10.1109/IGARSS.2019.8900532.
  46. Graph-Structured Representations for Visual Question Answering, pp. 1–9. URL: https://openaccess.thecvf.com/content_cvpr_2017/html/Teney_Graph-Structured_Representations_for_CVPR_2017_paper.html.
  47. Multimodal Few-Shot Learning with Frozen Language Models. Advances in Neural Information Processing Systems 34. URL: https://papers.nips.cc/paper/2021/hash/01b7575c38dac42f3cfb7d500438b875-Abstract.html.
  48. Towards a Collective Agenda on AI for Earth Science Data Analysis. IEEE Geoscience and Remote Sensing Magazine 9, 88–104. URL: http://arxiv.org/abs/2104.05107, doi:10.1109/MGRS.2020.3043504. arXiv: 2104.05107.
  49. Attention is All you Need, in: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (Eds.), Advances in Neural Information Processing Systems, Curran Associates, Inc. URL: https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
  50. UFO: A UniFied TransfOrmer for Vision-Language Representation Learning. arXiv:2111.10023 [cs] URL: http://arxiv.org/abs/2111.10023. arXiv: 2111.10023.
  51. HuggingFace’s Transformers: State-of-the-art Natural Language Processing. URL: http://arxiv.org/abs/1910.03771. arXiv:1910.03771 [cs].
  52. Image Captioning and Visual Question Answering Based on Attributes and External Knowledge. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 1367–1381. doi:10.1109/TPAMI.2017.2708709. conference Name: IEEE Transactions on Pattern Analysis and Machine Intelligence.
  53. Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arXiv:1609.08144 [cs] URL: http://arxiv.org/abs/1609.08144. arXiv: 1609.08144.
  54. DOTA: A Large-Scale Dataset for Object Detection in Aerial Images, in: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, UT. pp. 3974–3983. URL: https://ieeexplore.ieee.org/document/8578516/, doi:10.1109/CVPR.2018.00418.
  55. AID: A Benchmark Data Set for Performance Evaluation of Aerial Scene Classification. IEEE Transactions on Geoscience and Remote Sensing 55, 3965–3981. URL: http://ieeexplore.ieee.org/document/7907303/, doi:10.1109/TGRS.2017.2685945.
  56. Bag-of-visual-words and spatial extensions for land-use classification, in: Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems - GIS ’10, ACM Press, San Jose, California. p. 270. URL: http://portal.acm.org/citation.cfm?doid=1869790.1869829, doi:10.1145/1869790.1869829.
  57. Stacked Attention Networks for Image Question Answering, in: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Las Vegas, NV, USA. pp. 21–29. URL: http://ieeexplore.ieee.org/document/7780379/, doi:10.1109/CVPR.2016.10.
  58. CoCa: Contrastive Captioners are Image-Text Foundation Models. Transactions on Machine Learning Research URL: https://openreview.net/forum?id=Ee277P3AYC.
  59. Florence: A New Foundation Model for Computer Vision. arXiv:2111.11432 [cs] URL: http://arxiv.org/abs/2111.11432. arXiv: 2111.11432.
  60. From Easy to Hard: Learning Language-guided Curriculum for Visual Question Answering on Remote Sensing Data. IEEE Transactions on Geoscience and Remote Sensing 60, 1–11. URL: http://arxiv.org/abs/2205.03147, doi:10.1109/TGRS.2022.3173811. arXiv:2205.03147 [cs].
  61. Change Detection Meets Visual Question Answering. IEEE Transactions on Geoscience and Remote Sensing 60, 1–13. URL: https://ieeexplore.ieee.org/document/9901476/, doi:10.1109/TGRS.2022.3203314.
  62. MERLOT: Multimodal Neural Script Knowledge Models. arXiv:2106.02636 [cs] URL: http://arxiv.org/abs/2106.02636. arXiv: 2106.02636.
  63. Saliency-Guided Unsupervised Feature Learning for Scene Classification. IEEE Transactions on Geoscience and Remote Sensing 53, 2175–2184. URL: http://ieeexplore.ieee.org/document/6910306/, doi:10.1109/TGRS.2014.2357078.
  64. Hierarchical and Robust Convolutional Neural Network for Very High-Resolution Remote Sensing Object Detection. IEEE Transactions on Geoscience and Remote Sensing 57, 5535–5548. URL: https://ieeexplore.ieee.org/document/8676107/, doi:10.1109/TGRS.2019.2900302.
  65. Mutual Attention Inception Network for Remote Sensing Visual Question Answering. IEEE Transactions on Geoscience and Remote Sensing , 1–14URL: https://ieeexplore.ieee.org/document/9444570/, doi:10.1109/TGRS.2021.3079918.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Christel Chappuis (2 papers)
  2. Eliot Walt (1 paper)
  3. Vincent Mendez (1 paper)
  4. Sylvain Lobry (16 papers)
  5. Bertrand Le Saux (59 papers)
  6. Devis Tuia (81 papers)
Citations (3)