TransportationGames: Benchmarking Transportation Knowledge of (Multimodal) Large Language Models (2401.04471v1)
Abstract: LLMs and multimodal LLMs (MLLMs) have shown excellent general capabilities, even exhibiting adaptability in many professional domains such as law, economics, transportation, and medicine. Currently, many domain-specific benchmarks have been proposed to verify the performance of (M)LLMs in specific fields. Among various domains, transportation plays a crucial role in modern society as it impacts the economy, the environment, and the quality of life for billions of people. However, it is unclear how much traffic knowledge (M)LLMs possess and whether they can reliably perform transportation-related tasks. To address this gap, we propose TransportationGames, a carefully designed and thorough evaluation benchmark for assessing (M)LLMs in the transportation domain. By comprehensively considering the applications in real-world scenarios and referring to the first three levels in Bloom's Taxonomy, we test the performance of various (M)LLMs in memorizing, understanding, and applying transportation knowledge by the selected tasks. The experimental results show that although some models perform well in some tasks, there is still much room for improvement overall. We hope the release of TransportationGames can serve as a foundation for future research, thereby accelerating the implementation and application of (M)LLMs in the transportation domain.
- Qwen technical report. arXiv preprint arXiv:2309.16609.
- Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966.
- Baichuan. 2023. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305.
- Think you have solved question answering? try arc, the ai2 reasoning challenge.
- Laiw: A chinese legal large language models benchmark (a technical report).
- Lawbench: Benchmarking legal knowledge of large language models.
- Mme: A comprehensive evaluation benchmark for multimodal large language models.
- Explanatory argument extraction of correct answers in resident medical exams.
- Legalbench: A collaboratively built benchmark for measuring legal reasoning in large language models.
- What can large language models do in chemistry? a comprehensive benchmark on eight tasks.
- Measuring massive multitask language understanding. In International Conference on Learning Representations.
- C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models.
- Tjalling C Koopmans. 1949. Optimum utilization of the transportation system. Econometrica: Journal of the Econometric Society, pages 136–146.
- David R Krathwohl. 2002. A revision of bloom’s taxonomy: An overview. Theory into practice, 41(4):212–218.
- Cmmlu: Measuring massive multitask language understanding in chinese.
- Improved baselines with visual instruction tuning.
- Mmbench: Is your multi-modal model an all-around player?
- OpenAI. 2023. Gpt-4 technical report.
- Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155.
- Wang Peng. 2023. Duomo/transgpt.
- Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207.
- Compositional task representations for large language models. In The Eleventh International Conference on Learning Representations.
- Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research.
- George R Taylor. 2015. The transportation revolution, 1815-60. Routledge.
- InternLM Team. 2023. Internlm: A multilingual language model with progressively enhanced capabilities. https://github.com/InternLM/InternLM.
- Llama: Open and efficient foundation language models.
- Xiang Wei. 2023. Duomo/transgpt.
- Baichuan 2: Open large-scale language models.
- mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration.
- Glm-130b: An open bilingual pre-trained model.
- Internlm-xcomposer: A vision-language large model for advanced text-image comprehension and composition.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
- A survey of large language models. arXiv preprint arXiv:2303.18223.
- Domain specialization as the key to make large language models disruptive: A comprehensive survey. arXiv preprint arXiv:2305.18703.
- Xue Zhang (93 papers)
- Xiangyu Shi (14 papers)
- Xinyue Lou (3 papers)
- Rui Qi (7 papers)
- Yufeng Chen (58 papers)
- Jinan Xu (64 papers)
- Wenjuan Han (36 papers)