Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

T-SciQ: Teaching Multimodal Chain-of-Thought Reasoning via Mixed Large Language Model Signals for Science Question Answering (2305.03453v4)

Published 5 May 2023 in cs.CL
T-SciQ: Teaching Multimodal Chain-of-Thought Reasoning via Mixed Large Language Model Signals for Science Question Answering

Abstract: LLMs have recently demonstrated exceptional performance in various NLP tasks. They have also shown the ability to perform chain-of-thought (CoT) reasoning to solve complex problems. Recent studies have explored CoT reasoning in complex multimodal scenarios, such as the science question answering task, by fine-tuning multimodal models with high-quality human-annotated CoT rationales. However, collecting high-quality COT rationales is usually time-consuming and costly. Besides, the annotated rationales are hardly accurate due to the external essential information missed. To address these issues, we propose a novel method termed T-SciQ that aims at teaching science question answering with LLM signals. The T-SciQ approach generates high-quality CoT rationales as teaching signals and is advanced to train much smaller models to perform CoT reasoning in complex modalities. Additionally, we introduce a novel data mixing strategy to produce more effective teaching data samples for simple and complex science question answer problems. Extensive experimental results show that our T-SciQ method achieves a new state-of-the-art performance on the ScienceQA benchmark, with an accuracy of 96.18%. Moreover, our approach outperforms the most powerful fine-tuned baseline by 4.5%. The code is publicly available at https://github.com/T-SciQ/T-SciQ.

A Critical Analysis of "T-SciQ: Teaching Multimodal Chain-of-Thought Reasoning via Mixed LLM Signals for Science Question Answering"

The paper "T-SciQ: Teaching Multimodal Chain-of-Thought Reasoning via Mixed LLM Signals for Science Question Answering" presents an innovative approach to enhance the performance of smaller models on complex multimodal science question answering tasks. By leveraging LLMs to generate high-quality chain-of-thought (CoT) rationales, the authors propose an advanced method that addresses the inefficiencies and inaccuracies inherent in human-annotated rationales. This essay provides a detailed examination of the key methodologies, results, and implications of this research.

Introduction and Context

The motivation behind this work arises from both the limitations of existing datasets and the potential of LLMs in generating CoT reasoning. Previous datasets for scientific problem-solving are limited in scale, and while LLMs have demonstrated remarkable CoT reasoning abilities in NLP tasks, extending these capabilities to multimodal scenarios presents significant challenges. The ScienceQA dataset, which includes diverse topics and skills, serves as the benchmark for evaluating the proposed method.

Methodology

The T-SciQ approach is structured into three main components: generating teaching data, mixing teaching data, and fine-tuning smaller models.

  1. Generating Teaching Data: The authors employ zero-shot prompting with LLMs to generate two types of teaching data: QA-CoT samples and QA-PCoT samples. While QA-CoT samples involve traditional CoT rationales, the QA-PCoT samples are derived via a three-step prompting process that decomposes complex problems into simpler subproblems, enhancing the LLM's reasoning output.
  2. Mixing Teaching Data: This novel data mixing strategy aims to combine the strengths of both QA-CoT and QA-PCoT datasets. By leveraging the validation set, the method selectively determines the appropriate teaching signal for each data example based on problem complexity.
  3. Fine-Tuning: Following a two-stage fine-tuning framework, the student models are trained first to generate rationales and then to infer answers. This structured approach ensures that the models can effectively utilize the high-quality rationales generated by LLMs.

Experimental Results

The empirical evaluation demonstrates substantial improvements over existing state-of-the-art models. Notably, the model trained using T-SciQ signals achieves an accuracy of 96.18% on the ScienceQA benchmark, outperforming the strongest fine-tuned baseline by 4.5% and exceeding human performance by 7.78%. These results confirm that the proposed method is highly effective for multimodal science question answering tasks.

Implications and Future Developments

The implications of this research are both practical and theoretical. Practically, the ability to train smaller models with high-quality CoT rationales generated by LLMs provides a cost-effective and scalable solution for complex AI tasks. Theoretically, the success of the data mixing strategy and the decomposition approach highlights the untapped potential of LLMs in improving multi-step reasoning processes.

Future research could extend this work by exploring the integration of additional LLM architectures and parameter-efficient fine-tuning techniques. Moreover, applying T-SciQ to other domains and reasoning tasks could further validate its versatility and robustness.

Conclusion

In summary, the "T-SciQ" approach presents a significant advancement in teaching multimodal CoT reasoning by leveraging LLMs. The rigorous methodology, combined with compelling experimental results, showcases the potential of this approach to enhance the capabilities of smaller models in science question answering tasks. This work sets a new benchmark and opens avenues for future exploration in the field of AI-driven education and multimodal reasoning.

The publicly available code fosters transparency and reproducibility, encouraging further research to build upon these foundational findings. The results indicate that high-quality CoT signals generated from LLMs can substantially improve AI models' performance in multimodal and complex problem-solving scenarios.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (56)
  1. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, 6077–6086.
  2. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877–1901.
  3. End-to-End Object Detection with Transformers. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I, 213–229.
  4. Big self-supervised models are strong semi-supervised learners. Advances in neural information processing systems, 33: 22243–22255.
  5. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588.
  6. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
  7. Explaining answers with entailment trees. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP).
  8. Specializing Smaller Language Models towards Multi-Step Reasoning. arXiv preprint arXiv:2301.12726.
  9. Complexity-based prompting for multi-step reasoning. arXiv preprint arXiv:2210.00720.
  10. Dynamic fusion with intra-and inter-modality attention flow for visual question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 6639–6648.
  11. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computational Linguistics, 9: 346–361.
  12. ICL-D3IE: In-context learning with diverse demonstrations updating for document information extraction. arXiv preprint arXiv:2303.05063.
  13. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, 770–778. IEEE Computer Society.
  14. Large Language Models Are Reasoning Teachers. arXiv preprint arXiv:2212.10071.
  15. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. arXiv preprint arXiv:2305.02301.
  16. LLM-Adapters: An Adapter Family for Parameter-Efficient Fine-Tuning of Large Language Models. arXiv preprint arXiv:2304.01933.
  17. Large language models can self-improve. arXiv preprint arXiv:2210.11610.
  18. Worldtree: A corpus of explanation graphs for elementary science questions supporting multi-hop inference. arXiv preprint arXiv:1802.03052.
  19. Are you smarter than a sixth grader? textbook question answering for multimodal machine comprehension. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 4999–5007.
  20. Unifiedqa: Crossing format boundaries with a single qa system. arXiv preprint arXiv:2005.00700.
  21. Decomposed prompting: A modular approach for solving complex tasks. arXiv preprint arXiv:2210.02406.
  22. Bilinear attention networks. Advances in neural information processing systems, 31.
  23. Vilt: Vision-and-language transformer without convolution or region supervision. In International Conference on Machine Learning, 5583–5594. PMLR.
  24. Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916.
  25. On Vision Features in Multimodal Machine Translation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 6327–6337.
  26. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557.
  27. On the advance of making language models better reasoners. arXiv preprint arXiv:2206.02336.
  28. Program induction by rationale generation: Learning to solve and explain algebraic word problems. arXiv preprint arXiv:1705.04146.
  29. Visual instruction tuning. arXiv preprint arXiv:2304.08485.
  30. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 35: 2507–2521.
  31. Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models. arXiv preprint arXiv:2304.09842.
  32. Dynamic prompt learning via policy gradient for semi-structured mathematical reasoning. arXiv preprint arXiv:2209.14610.
  33. Iconqa: A new benchmark for abstract diagram understanding and visual language reasoning. arXiv preprint arXiv:2110.13214.
  34. Teaching small language models to reason. arXiv preprint arXiv:2212.08410.
  35. Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114.
  36. OpenAI. 2022. Introducing chatgpt. https://openai.com/blog/chatgpt.
  37. OpenAI. 2023. GPT-4 Technical Report. CoRR, abs/2303.08774.
  38. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, 8748–8763. PMLR.
  39. Learning to retrieve prompts for in-context learning. arXiv preprint arXiv:2112.08633.
  40. Visuo-Lingustic Question Answering (VLQA) Challenge. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings (EMNLP), 4606–4616.
  41. Commonsenseqa: A question answering challenge targeting commonsense knowledge. arXiv preprint arXiv:1811.00937.
  42. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239.
  43. R33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT Prompting: Review, Rephrase and Resolve for Chain-of-Thought Reasoning in Large Language Models under Noisy Context. arXiv preprint arXiv:2310.16535.
  44. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  45. Attention is All you Need. In Advances in Neural Information Processing Systems 30, 5998–6008.
  46. Iteratively prompt pre-trained language models for chain of thought. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2714–2730.
  47. Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. arXiv preprint arXiv:2305.04091.
  48. Rationale-augmented ensembles in language models. arXiv preprint arXiv:2207.00747.
  49. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171.
  50. Chain of Thought Prompting Elicits Reasoning in Large Language Models. ArXiv preprint, abs/2201.11903.
  51. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903.
  52. Deep modular co-attention networks for visual question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 6281–6290.
  53. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199.
  54. Automatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493.
  55. Multimodal chain-of-thought reasoning in language models. arXiv preprint arXiv:2302.00923.
  56. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Lei Wang (975 papers)
  2. Yi Hu (129 papers)
  3. Jiabang He (6 papers)
  4. Xing Xu (48 papers)
  5. Ning Liu (199 papers)
  6. Hui Liu (481 papers)
  7. Heng Tao Shen (117 papers)
Citations (34)
Github Logo Streamline Icon: https://streamlinehq.com

GitHub