Chain-of-Thought Hub: A Continuous Effort to Measure Large Language Models' Reasoning Performance (2305.17306v1)

Published 26 May 2023 in cs.CL, cs.AI, and cs.LG

Abstract: As LLMs are continuously being developed, their evaluation becomes increasingly important yet challenging. This work proposes Chain-of-Thought Hub, an open-source evaluation suite on the multi-step reasoning capabilities of LLMs. We are interested in this setting for two reasons: (1) from the behavior of GPT and PaLM model family, we observe that complex reasoning is likely to be a key differentiator between weaker and stronger LLMs; (2) we envisage LLMs to become the next-generation computational platform and foster an ecosystem of LLM-based new applications, this naturally requires the foundation models to perform complex tasks that often involve the composition of linguistic and logical operations. Our approach is to compile a suite of challenging reasoning benchmarks to track the progress of LLMs. Our current results show that: (1) model scale clearly correlates with reasoning capabilities; (2) As of May 2023, Claude-v1.3 and PaLM-2 are the only two models that are comparable with GPT-4, while open-sourced models still lag behind; (3) LLaMA-65B performs closely to code-davinci-002, indicating that with successful further development such as reinforcement learning from human feedback (RLHF), it has great potential to be close to GPT-3.5-Turbo. Our results also suggest that for the open-source efforts to catch up, the community may focus more on building better base models and exploring RLHF.

PDF Abstract

Chain-of-Thought Hub: Evaluating Reasoning in LLMs

The paper "Chain-of-Thought Hub: A Continuous Effort to Measure LLMs' Reasoning Performance" introduces an open-source evaluation suite aimed at assessing the reasoning capabilities of LLMs. The emergence and rapid development of LLMs, such as those belonging to the GPT, PaLM, and Claude model families, have underscored the necessity for robust evaluation mechanisms, particularly to discern complex reasoning capabilities that differentiate these models based on scale and other architectural nuances.

The authors present two primary motivations for their work: first, the need for comprehensive benchmarks that can distinguish between stronger and weaker models based on their reasoning ability, and second, the vision of LLMs as future computational platforms necessitating advanced reasoning for linguistic and logical operations. The Chain-of-Thought Hub (CoT Hub) addresses these needs by compiling a suite of challenging reasoning benchmarks and creating a continuous evaluation ecosystem for LLMs.

Methodological Approach

The CoT Hub compiles a variety of reasoning benchmarks encompassing a broad range of tasks, including mathematics, symbolic reasoning, and coding. Key datasets incorporated in CoT Hub include GSM8k, MATH, MMLU, BigBench Hard, HumanEval, and C-Eval. Each of these datasets serves a specific purpose:

GSM8k tests arithmetic and compositional reasoning capabilities.
MATH challenges advanced math and science reasoning.
MMLU gauges high-school and college-level knowledge across 57 subjects.
BigBench Hard evaluates language and symbolic reasoning suited for chain-of-thought prompting.
HumanEval assesses coding capabilities through Python programming tasks.
C-Eval expands the suite's scope to include Chinese multi-choice questions across various disciplines.

Few-shot chain-of-thought prompting forms the core of the evaluation method, distinguishing it from other efforts like HeLM and Chatbot Arena, which employ either answer-only prompting or focus on user preferences in dialogue. The decision to use few-shot prompting stems from its applicability across both pre-trained and instruction-tuned models, thereby offering a balanced evaluation approach.

Experimental Insights

The paper presents a detailed analysis of various prominent LLMs, assessing their performance on the CoT Hub benchmarks. Significant observations include:

A demonstrated correlation between model size and reasoning performance, with larger models generally exhibiting superior capabilities.
Leading LLMs such as GPT-4, Claude-v1.3, and PaLM-2 showcasing their prowess, while open-source models like LLaMA and FlanT5 lag behind.

The experimental results highlight the effectiveness of Reinforcement Learning from Human Feedback (RLHF) in enhancing model performance post-training, as evident by the superior performance of RLHF models compared to base versions. Despite this, there remains a marked gap in performance between closed-source and open-source models.

Implications and Future Directions

The CoT Hub serves as an invaluable resource for the LLM community, providing a structured platform to guide subsequent development efforts. The insights regarding RLHF and its potential to bridge performance gaps present a compelling avenue for future research, particularly within open-source ecosystems. Notably, the paper underscores the potential of the LLaMA 65B model, suggesting that with optimized RLHF application, it could achieve performance parity with models like ChatGPT-3.5.

Moving forward, the authors envisage the expansion of CoT Hub to include additional reasoning datasets, insights into model alignment through RLHF, and the exploration of interaction with external computational tools. This trajectory positions CoT Hub not only as a benchmark suite but as a facilitator of innovation in the field of AI reasoning, with the potential to drive methodological improvements and foster competitive advancements across diverse modeling frameworks.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Yao Fu (83 papers)
Litu Ou (5 papers)
Mingyu Chen (31 papers)
Yuhao Wan (7 papers)
Hao Peng (291 papers)
Tushar Khot (53 papers)

Citations (97)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos