Measuring Vision-Language STEM Skills of Neural Models (2402.17205v3)
Abstract: We introduce a new challenge to test the STEM skills of neural models. The problems in the real world often require solutions, combining knowledge from STEM (science, technology, engineering, and math). Unlike existing datasets, our dataset requires the understanding of multimodal vision-language information of STEM. Our dataset features one of the largest and most comprehensive datasets for the challenge. It includes 448 skills and 1,073,146 questions spanning all STEM subjects. Compared to existing datasets that often focus on examining expert-level ability, our dataset includes fundamental skills and questions designed based on the K-12 curriculum. We also add state-of-the-art foundation models such as CLIP and GPT-3.5-Turbo to our benchmark. Results show that the recent model advances only help master a very limited number of lower grade-level skills (2.5% in the third grade) in our dataset. In fact, these models are still well below (averaging 54.7%) the performance of elementary students, not to mention near expert-level performance. To understand and increase the performance on our dataset, we teach the models on a training split of our dataset. Even though we observe improved performance, the model performance remains relatively low compared to average elementary students. To solve STEM problems, we will need novel algorithmic innovations from the community.
- Bottom-up and top-down attention for image captioning and visual question answering. In CVPR, pp. 6077–6086, 2018.
- VQA: visual question answering. In ICCV, pp. 2425–2433, 2015.
- Ixl design principles. 2021.
- PIQA: reasoning about physical commonsense in natural language. In AAAI, pp. 7432–7439, 2020.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Big self-supervised models are strong semi-supervised learners. In NeurIPS, 2020a.
- Uniter: Universal image-text representation learning. In ECCV, pp. 104–120, 2020b.
- Palm: Scaling language modeling with pathways. CoRR, abs/2204.02311, 2022.
- Agent instructs large language models to be general zero-shot reasoners. arXiv preprint arXiv:2310.03710, 2023.
- Virtex: Learning visual representations from textual annotations. In CVPR, pp. 11162–11173, 2021.
- BERT: pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,, pp. 4171–4186, 2019.
- An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2020.
- Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In CVPR, pp. 6325–6334, 2017.
- On calibration of modern neural networks. In ICML, pp. 1321–1330, 2017a.
- On calibration of modern neural networks. In International conference on machine learning, pp. 1321–1330. PMLR, 2017b.
- Deep residual learning for image recognition. In CVPR, pp. 770–778, 2016.
- Measuring massive multitask language understanding. In ICLR, 2021a.
- Measuring mathematical problem solving with the math dataset. NeurIPS, 2021b.
- Mathprompter: Mathematical reasoning using large language models. CoRR, abs/2303.05398, 2023.
- IXL. Understanding the ixl smartscore. https://blog.ixl.com/wp-content/uploads/2014/11/SmartScore-guide.pdf, a.
- IXL. How does the smartscore work? https://www.ixl.com/help-center/article/1272663/how_does_the_smartscore_work, b.
- Draft, sketch, and prove: Guiding formal theorem provers with informal proofs. In ICLR, 2023.
- CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. In CVPR, pp. 1988–1997, 2017.
- Referitgame: Referring to objects in photographs of natural scenes. In ACL, pp. 787–798, 2014.
- Unifiedqa: Crossing format boundaries with a single QA system. In Findings of EMNLP, pp. 1896–1907, 2020.
- Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, pp. 32–73, 2017.
- IXL Learning. The impact of ixl math and ixl ela on student achievement in grades pre-k to 12 (pp. 1–27), 2019.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pp. 12888–12900. PMLR, 2022a.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
- Grounded language-image pre-training. In CVPR, pp. 10955–10965, 2022b.
- Microsoft coco: Common objects in context. In ECCV, pp. 740–755, 2014.
- Fimo: A challenge formal dataset for automated theorem proving, 2023.
- Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In NeurIPS, pp. 13–23, 2019.
- 12-in-1: Multi-task vision and language representation learning. In CVPR, 2020.
- Inter-GPS: Interpretable geometry problem solving with formal language and symbolic reasoning. In ACL-IJCNLP, pp. 6774–6786, 2021a.
- Iconqa: A new benchmark for abstract diagram understanding and visual language reasoning. In NeurIPS, 2021b.
- Learn to explain: Multimodal reasoning via thought chains for science question answering. In NeurIPS, 2022.
- Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 427–436, 2015.
- GLIDE: towards photorealistic image generation and editing with text-guided diffusion models. In ICML, pp. 16784–16804, 2022.
- OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- Preparing lessons for progressive training on language models. arXiv preprint arXiv:2401.09192, 2024a.
- Reusing pretrained models by multi-linear operators for efficient training. Advances in Neural Information Processing Systems, 36, 2024b.
- Kosmos-2: Grounding multimodal large language models to the world, 2023.
- Glove: Global vectors for word representation. In EMNLP, pp. 1532–1543, 2014.
- Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In ICCV, pp. 2641–2649, 2015.
- Improving language understanding by generative pre-training. 2018.
- Language models are unsupervised multitask learners. OpenAI blog, 2019.
- Learning transferable visual models from natural language supervision. In ICML, pp. 8748–8763, 2021.
- Analysing mathematical reasoning abilities of neural models. In ICLR, 2019.
- Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, pp. 2556–2565, 2018.
- Benchmarking language models for code syntax understanding. In EMNLP, 2022a.
- Palt: Parameter-lite transfer of language models for knowledge graph completion. In EMNLP, 2022b.
- A corpus of natural language for visual reasoning. In ACL, pp. 217–223, 2017.
- Eva-clip: Improved training techniques for clip at scale, 2023.
- YFCC100M: the new data in multimedia research. Commun. ACM, pp. 64–73, 2016.
- Attention is all you need. NeurIPS, 2017.
- Language models are open knowledge graphs. arXiv preprint arXiv:2010.11967, 2020.
- Deepstruct: Pretraining of language models for structure prediction. In ACL, 2022a.
- DT-solver: Automated theorem proving with dynamic-tree sampling guided by proof-level value function. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 12632–12646, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.706. URL https://aclanthology.org/2023.acl-long.706.
- Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In International Conference on Machine Learning, pp. 23318–23340. PMLR, 2022b.
- Dq-lore: Dual queries with low rank approximation re-ranking for in-context learning, 2023a.
- TRIGO: Benchmarking formal mathematical proof reduction for generative language models. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 11594–11632, Singapore, December 2023b. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.711. URL https://aclanthology.org/2023.emnlp-main.711.
- Yin and yang: Balancing and answering binary visual questions. In CVPR, pp. 5014–5022, 2016.
- minif2f: a cross-system benchmark for formal olympiad-level mathematics. In ICLR, 2022.
- Visual7w: Grounded question answering in images. In CVPR, pp. 4995–5004, 2016.
- Jianhao Shen (18 papers)
- Ye Yuan (274 papers)
- Srbuhi Mirzoyan (1 paper)
- Ming Zhang (313 papers)
- Chenguang Wang (59 papers)