LatEval: An Interactive LLMs Evaluation Benchmark with Incomplete Information from Lateral Thinking Puzzles (2308.10855v3)
Abstract: With the continuous evolution and refinement of LLMs, they are endowed with impressive logical reasoning or vertical thinking capabilities. But can they think out of the box? Do they possess proficient lateral thinking abilities? Following the setup of Lateral Thinking Puzzles, we propose a novel evaluation benchmark, LatEval, which assesses the model's lateral thinking within an interactive framework. In our benchmark, we challenge LLMs with 2 aspects: the quality of questions posed by the model and the model's capability to integrate information for problem-solving. We find that nearly all LLMs struggle with employing lateral thinking during interactions. For example, even the most advanced model, GPT-4, exhibits the advantage to some extent, yet still maintain a noticeable gap when compared to human. This evaluation benchmark provides LLMs with a highly challenging and distinctive task that is crucial to an effective AI assistant.
- Do as i can, not as i say: Grounding language in robotic affordances. In Conference on Robot Learning.
- Shabana Ali. 2019. Lateral thinking puzzles encourage creative thinking. Nurse educator, 44(4):191.
- Baichuan. 2023. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305.
- A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023.
- Chatgpt is a knowledgeable but inexperienced solver: An investigation of commonsense problem in large language models. arXiv preprint arXiv:2303.16421.
- Assessing cross-cultural alignment between chatgpt and human societies: An empirical study. ArXiv, abs/2303.17466.
- Evaluating the feasibility of chatgpt in healthcare: an analysis of multiple clinical and research scenarios. Journal of Medical Systems, 47(1):33.
- The promise and peril of using a large language model to obtain clinical information: Chatgpt performs strongly as a fertility counseling tool with limitations. Fertility and Sterility.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
- Edward de Bono. 1999. Lateral thinking. Current Biology, 9.
- Language model cascades. arXiv preprint arXiv:2207.10342.
- Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320–335.
- Automatic text summarization: A comprehensive survey. Expert systems with applications, 165:113679.
- Thilo Hagendorff and Sarah Fabi. 2023. Human-like intuitive behavior and reasoning biases emerged in language models–and disappeared in gpt-4. arXiv preprint arXiv:2306.07622.
- Recent progress in leveraging deep learning methods for question answering. Neural Computing and Applications, pages 1–19.
- Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300.
- James S Hernandez and MBBS Prathibha Varkey. 2008. Vertical versus lateral thinking. Physician executive, 34(3):26.
- The effectiveness of enrichment test instruments design to measure students’ creative thinking skills and problem-solving. Thinking Skills and Creativity.
- A survey of knowledge-enhanced pre-trained language models. ArXiv, abs/2211.05994.
- Towards attribute-entangled controllable text generation: A pilot study of blessing generation. In Proceedings of the 2nd Workshop on Natural Language Generation, Evaluation, and Metrics (GEM), pages 235–247, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
- Inner monologue: Embodied reasoning through planning with language models. In Conference on Robot Learning.
- C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. arXiv preprint arXiv:2305.08322.
- Evaluating human-language model interaction. ArXiv, abs/2212.09746.
- A survey on retrieval-augmented text generation. ArXiv, abs/2202.01110.
- A diversity-promoting objective function for neural conversation models. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 110–119.
- A survey on out-of-distribution evaluation of neural nlp models. arXiv preprint arXiv:2306.15261.
- On the (in)effectiveness of large language models for chinese text correction. CoRR, abs/2307.09007.
- Holistic evaluation of language models. arXiv preprint arXiv:2211.09110.
- Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
- Linguistic rules-based corpus generation for native Chinese grammatical error correction. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 576–589, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- OpenAI. 2023. Gpt-4 technical report. ArXiv, abs/2303.08774.
- Bbq: A hand-built bias benchmark for question answering. In Findings.
- Tool learning with foundation models. ArXiv, abs/2304.08354.
- Cognitive processes and creative lateral thinking in students of the naval school of peru. Neuroquantology, 19:10–17.
- Sandra W Russ. 1988. Primary process thinking, divergent thinking, and coping in children. Journal of Personality Assessment, 52(3):539–548.
- Sandra Walker Russ. 2013. Affect and creativity: The role of affect and play in the creative process. Routledge.
- Together Computer SambaNova Systems. 2023. Bloomchat: a new open multilingual chat llm.
- Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. arXiv preprint arXiv:2303.17580.
- Paul Sloane and Des MacHale. 1994. Great lateral thinking puzzles. Sterling Publishing Company, Inc.
- Creative thinking patterns in student‘s scientific works. Eurasian Journal of Educational Research.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Kuan Chen Tsai. 2012. Play, imagination, and creativity: A brief literature review. Journal of Education and learning, 1(2):15–20.
- Interactive natural language processing. ArXiv, abs/2305.13246.
- A survey of human-in-the-loop for machine learning. Future Generation Computer Systems, 135:364–381.
- Foundation models for decision making: Problems, methods, and opportunities. ArXiv, abs/2303.04129.
- A survey of large language models. ArXiv, abs/2303.18223.
- Shulin Huang (12 papers)
- Shirong Ma (23 papers)
- Yinghui Li (65 papers)
- Mengzuo Huang (3 papers)
- Wuhe Zou (3 papers)
- Weidong Zhang (41 papers)
- Hai-Tao Zheng (94 papers)