Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them (2210.09261v1)

Published 17 Oct 2022 in cs.CL and cs.AI

Abstract: BIG-Bench (Srivastava et al., 2022) is a diverse evaluation suite that focuses on tasks believed to be beyond the capabilities of current LLMs. LLMs have already made good progress on this benchmark, with the best model in the BIG-Bench paper outperforming average reported human-rater results on 65% of the BIG-Bench tasks via few-shot prompting. But on what tasks do LLMs fall short of average human-rater performance, and are those tasks actually unsolvable by current LLMs? In this work, we focus on a suite of 23 challenging BIG-Bench tasks which we call BIG-Bench Hard (BBH). These are the task for which prior LLM evaluations did not outperform the average human-rater. We find that applying chain-of-thought (CoT) prompting to BBH tasks enables PaLM to surpass the average human-rater performance on 10 of the 23 tasks, and Codex (code-davinci-002) to surpass the average human-rater performance on 17 of the 23 tasks. Since many tasks in BBH require multi-step reasoning, few-shot prompting without CoT, as done in the BIG-Bench evaluations (Srivastava et al., 2022), substantially underestimates the best performance and capabilities of LLMs, which is better captured via CoT prompting. As further analysis, we explore the interaction between CoT and model scale on BBH, finding that CoT enables emergent task performance on several BBH tasks with otherwise flat scaling curves.

PDF Abstract

Analysis of "Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them"

The paper "Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them" investigates the efficacy of Chain-of-Thought (CoT) prompting applied to a subset of complex tasks from the BIG-Bench benchmark, termed BIG-Bench Hard (BBH). The paper provides a nuanced view of CoT prompting's capacity to enhance LLM performance on tasks where existing models have underperformed relative to human raters.

Overview

BIG-Bench is an evaluation suite designed to push the boundaries of LLM capabilities, testing on tasks presumed beyond the reach of current models. This paper zeroes in on 23 specific tasks, collectively labeled BBH, where prior models failed to meet average human-rater benchmarks. These tasks include diverse challenges requiring multifaceted reasoning skills, such as logical deduction and arithmetic reasoning. By employing CoT prompting, the authors explore whether these tasks can be rendered more tractable to contemporary models like PaLM and Codex.

Key Findings

Performance Gains with CoT Prompting: CoT prompting shows noteworthy performance improvements across BBH tasks. For Codex (code-davinci-002), CoT prompting facilitates exceeding average human-rater scores on 17 of the 23 tasks, a significant leap compared to standard answer-only prompting, which surpasses human-rater performance on only five tasks.
Scale and Emergent Abilities: The paper explores the relationship between model scale and CoT effectiveness. Results indicate that effectively leveraging CoT requires models of sufficiently large scale to achieve emergent abilities, where tasks initially showing flat scaling curves become tractable with increased model size and CoT prompting.
Task-Specific Insights: The paper provides task-specific results, distinguishing performance improvements mainly in algorithmic tasks, where CoT aids in decomposing complex reasoning problems. However, CoT prompting did not universally enhance performance; tasks requiring substantial world knowledge or emotional nuance, like Ruin Names or Snarks, benefited less from CoT.

Implications

The findings have significant implications for the future direction of NLP research. The success of CoT in enhancing model reasoning capabilities suggests a promising avenue for refining prompt crafting strategies, especially for complex, multi-step reasoning tasks. This also indicates that future research might focus on refining CoT and related strategies or developing tailored approaches for tasks where CoT is less effective.

Future Prospects in AI Development

The paper raises vital considerations for both theoretical exploration and practical deployment of AI systems:

Theoretical: Understanding the underpinnings of CoT prompting and its interaction with model architecture could illuminate aspects of model interpretability and learning mechanisms.
Practical: As AI systems integrate into more decision-critical applications, effective techniques like CoT prompting promise to enhance model reliability and performance across diverse, real-world scenarios.

Conclusion

The paper makes a valuable contribution by elucidating the potential of CoT prompting to solve ostensibly intractable tasks within BBH. The detailed analysis of task-specific gains and the interaction between prompting strategies and model scale enriches the discourse on the capabilities and limitations of modern LLMs. As models continue to scale, CoT prompting may become pivotal in ensuring these models can tackle an increasingly complex array of challenges with human-like proficiency.

PDF Markdown Bookmark Chat (Pro)

Authors (11)

Mirac Suzgun (23 papers)
Nathan Scales (8 papers)
Nathanael Schärli (8 papers)
Sebastian Gehrmann (48 papers)
Yi Tay (94 papers)
Hyung Won Chung (30 papers)
Aakanksha Chowdhery (19 papers)
Quoc V. Le (128 papers)
Ed H. Chi (74 papers)
Denny Zhou (65 papers)
Jason Wei (49 papers)

Citations (786)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos