Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
143 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DesignQA: A Multimodal Benchmark for Evaluating Large Language Models' Understanding of Engineering Documentation (2404.07917v2)

Published 11 Apr 2024 in cs.AI and cs.CL

Abstract: This research introduces DesignQA, a novel benchmark aimed at evaluating the proficiency of multimodal LLMs (MLLMs) in comprehending and applying engineering requirements in technical documentation. Developed with a focus on real-world engineering challenges, DesignQA uniquely combines multimodal data-including textual design requirements, CAD images, and engineering drawings-derived from the Formula SAE student competition. Different from many existing MLLM benchmarks, DesignQA contains document-grounded visual questions where the input image and input document come from different sources. The benchmark features automatic evaluation metrics and is divided into segments-Rule Comprehension, Rule Compliance, and Rule Extraction-based on tasks that engineers perform when designing according to requirements. We evaluate state-of-the-art models (at the time of writing) like GPT-4o, GPT-4, Claude-Opus, Gemini-1.0, and LLaVA-1.5 against the benchmark, and our study uncovers the existing gaps in MLLMs' abilities to interpret complex engineering documentation. The MLLMs tested, while promising, struggle to reliably retrieve relevant rules from the Formula SAE documentation, face challenges in recognizing technical components in CAD images, and encounter difficulty in analyzing engineering drawings. These findings underscore the need for multimodal models that can better handle the multifaceted questions characteristic of design according to technical documentation. This benchmark sets a foundation for future advancements in AI-supported engineering design processes. DesignQA is publicly available at: https://github.com/anniedoris/design_qa/.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. OpenAI. ‘‘GPT-4V(ision) System Card.’’ 2023. URL https://api.semanticscholar.org/CorpusID:263218031.
  2. ‘‘Attention is all you need.’’ Advances in neural information processing systems Vol. 30 (2017).
  3. ‘‘Sparks of artificial general intelligence: Early experiments with GPT-4. arXiv.’’ arXiv preprint arXiv:2303.12712 (2023).
  4. Barbhuiya, Rejaul Karim. ‘‘Introduction to Artificial Intelligence: Current Developments, Concerns and Possibilities for Education.’’ Indian Journal of Educational Technology Vol. 5 No. 2 (2023): p. 266.
  5. ‘‘Large language models in medicine.’’ Nature medicine Vol. 29 No. 8 (2023): pp. 1930–1940.
  6. ‘‘The future landscape of large language models in medicine.’’ Communications Medicine Vol. 3 No. 1 (2023): p. 141.
  7. ‘‘ChatGPT for good? On opportunities and challenges of large language models for education.’’ Learning and individual differences Vol. 103 (2023): p. 102274.
  8. ‘‘How Can Large Language Models Help Humans in Design and Manufacturing?’’ arXiv preprint arXiv:2307.14377 (2023).
  9. ‘‘From Concept to Manufacturing: Evaluating Vision-Language Models for Engineering Design.’’ arXiv preprint arXiv:2311.12668 (2023).
  10. Product design and development. McGraw-hill (2016).
  11. ‘‘What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?’’ arXiv preprint arXiv:2307.02469 (2023).
  12. ‘‘T-SciQ: Teaching Multimodal Chain-of-Thought Reasoning via Large Language Model Signals for Science Question Answering.’’ arXiv preprint arXiv:2305.03453 (2023).
  13. ‘‘Visual instruction tuning.’’ Advances in neural information processing systems Vol. 36 (2024).
  14. ‘‘ZeroSCROLLS: A Zero-Shot Benchmark for Long Text Understanding.’’ arXiv preprint arXiv:2305.14196 (2023).
  15. ‘‘Effective long-context scaling of foundation models.’’ arXiv preprint arXiv:2309.16039 (2023).
  16. ‘‘Mctest: A challenge dataset for the open-domain machine comprehension of text.’’ Proceedings of the 2013 conference on empirical methods in natural language processing: pp. 193–203. 2013.
  17. ‘‘Squad: 100,000+ questions for machine comprehension of text.’’ arXiv preprint arXiv:1606.05250 (2016).
  18. ‘‘Wikiqa: A challenge dataset for open-domain question answering.’’ Proceedings of the 2015 conference on empirical methods in natural language processing: pp. 2013–2018. 2015.
  19. ‘‘A dataset of information-seeking questions and answers anchored in research papers.’’ arXiv preprint arXiv:2105.03011 (2021).
  20. ‘‘MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models.’’ arXiv preprint arXiv:2306.13394 (2023).
  21. ‘‘Mmbench: Is your multi-modal model an all-around player?’’ arXiv preprint arXiv:2307.06281 (2023).
  22. ‘‘Learn to explain: Multimodal reasoning via thought chains for science question answering.’’ Advances in Neural Information Processing Systems Vol. 35 (2022): pp. 2507–2521.
  23. ‘‘Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions?’’ arXiv preprint arXiv:2302.11713 (2023).
  24. ‘‘Multi-modal machine learning in engineering design: A review and future directions.’’ Journal of Computing and Information Science in Engineering Vol. 24 No. 1 (2024): p. 010801.
  25. ‘‘Adapting natural language processing for technical text.’’ Applied AI Letters Vol. 2 No. 3 (2021): p. e33.
  26. ‘‘Technical language processing: Unlocking maintenance knowledge.’’ Manufacturing Letters Vol. 27 (2021): pp. 42–46.
  27. ‘‘Deep learning for technical document classification.’’ IEEE Transactions on Engineering Management (2022).
  28. ‘‘Generative transformers for design concept generation.’’ Journal of Computing and Information Science in Engineering Vol. 23 No. 4 (2023): p. 041003.
  29. ‘‘Pure: A dataset of public requirements documents.’’ 2017 IEEE 25th International Requirements Engineering Conference (RE): pp. 502–505. 2017. IEEE.
  30. ‘‘A survey on evaluation of large language models.’’ arXiv preprint arXiv:2307.03109 (2023).
  31. ‘‘Bleu: a method for automatic evaluation of machine translation.’’ Proceedings of the 40th annual meeting of the Association for Computational Linguistics: pp. 311–318. 2002.
  32. ‘‘Nougat: Neural optical understanding for academic documents.’’ arXiv preprint arXiv:2308.13418 (2023).
  33. Lin, Chin-Yew. ‘‘Rouge: A package for automatic evaluation of summaries.’’ Text summarization branches out: pp. 74–81. 2004.
  34. ‘‘Visual Instruction Tuning.’’ (2023).
  35. ‘‘Improved baselines with visual instruction tuning.’’ arXiv preprint arXiv:2310.03744 (2023).
  36. Liu, Jerry. ‘‘LlamaIndex.’’ (2022). 10.5281/zenodo.1234. URL https://github.com/jerryjliu/llama_index.
Citations (8)

Summary

We haven't generated a summary for this paper yet.