Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

UNICORN on RAINBOW: A Universal Commonsense Reasoning Model on a New Multitask Benchmark (2103.13009v1)

Published 24 Mar 2021 in cs.CL

Abstract: Commonsense AI has long been seen as a near impossible goal -- until recently. Now, research interest has sharply increased with an influx of new benchmarks and models. We propose two new ways to evaluate commonsense models, emphasizing their generality on new tasks and building on diverse, recently introduced benchmarks. First, we propose a new multitask benchmark, RAINBOW, to promote research on commonsense models that generalize well over multiple tasks and datasets. Second, we propose a novel evaluation, the cost equivalent curve, that sheds new insight on how the choice of source datasets, pretrained LLMs, and transfer learning methods impacts performance and data efficiency. We perform extensive experiments -- over 200 experiments encompassing 4800 models -- and report multiple valuable and sometimes surprising findings, e.g., that transfer almost always leads to better or equivalent performance if following a particular recipe, that QA-based commonsense datasets transfer well with each other, while commonsense knowledge graphs do not, and that perhaps counter-intuitively, larger models benefit more from transfer than smaller ones. Last but not least, we introduce a new universal commonsense reasoning model, UNICORN, that establishes new state-of-the-art performance across 8 popular commonsense benchmarks, aNLI (87.3%), CosmosQA (91.8%), HellaSWAG (93.9%), PIQA (90.1%), SocialIQa (83.2%), WinoGrande (86.6%), CycIC (94.0%) and CommonsenseQA (79.3%).

An Evaluation and Advancement in Commonsense AI: RAINBOW and UNICORN

The paper in focus introduces significant contributions to the domain of commonsense reasoning in AI, presenting a multi-task benchmark, RAINBOW, and a universal commonsense reasoning model, UNICORN. With the growth of AI capabilities, particularly via pretrained LLMs, understanding and evaluating the integration of commonsense across varied tasks has emerged as a vital, yet challenging, frontier. The authors address this by emphasizing the importance of systematic generalization across tasks through extensive experimentation.

RAINBOW: A New Benchmark for Commonsense Evaluation

RAINBOW is established as a multitask benchmark to advance research on common sense models that generalize effectively across numerous tasks and datasets. The benchmark is composed of six commonsense reasoning tasks—&NLI, COSMOSQA, HELLASWAG, PIQA, SOCIALIQA, and WINOGRANDE—each chosen for their distinct evaluation of commonsense knowledge, ranging from social interactions to physical reasoning. This collection of tasks is pivotal in investigating the transfer learning capabilities across diverse domains within commonsense reasoning.

Cost Equivalent Curves: Innovating Evaluation

In tandem with RAINBOW, the authors introduce a novel evaluation methodology, the cost equivalent curve. This approach is pivotal in understanding data efficiency by comparing how much training data is required by different learning methods to reach equivalent performance levels. The curves reflect the impact of transfer learning across tasks, providing insights into data efficiency and transferability, especially in low-data regimes where multitask learning consistently outperforms single-task strategies.

UNICORN: Universal Commonsense Reasoning Model

The paper proposes UNICORN, a model trained on the RAINBOW benchmark capable of setting new state-of-the-art results across the tasks within RAINBOW and beyond. UNICORN specifically excelled with impressive performance improvements such as 87.3% on &NLI, 91.8% on COSMOSQA, and 94% on CYCLIC. The model's architecture and training leverage insights from the cost equivalent curves, highlighting the significant performance and data efficiency gains attainable through intermediate-task transfer strategies, particularly sequential training over multitask training.

Empirical Insights and Transfer Learning

The authors detail extensive empirical insights acquired from over 200 experiments. A consistent finding is that intermediate-task transfer generally ensures superior performance, particularly for larger model architectures. Notably, QA-based commonsense datasets exhibit effective transfer between each other, whereas knowledge graphs do not demonstrate similar transferability. These insights are crucial for refining the training strategies of LLMs dealing with commonsense reasoning.

Implications and Future Direction

The implications of this research are profound, setting a foundation for future advancements in more generalized and efficient commonsense reasoning AI. By introducing RAINBOW as a benchmark and adopting cost equivalent curves as an evaluation tool, this paper paves the way for more efficient designs of AI models that require fewer resources while maintaining high performance. The approach could be further extended to other AI domains where multitasking and transfer learning play crucial roles.

In conclusion, this paper contributes significantly to the AI community by not only setting new standards in commonsense reasoning through UNICORN but also by introducing a systematic approach to evaluate efficiency and transferability across models and datasets through RAINBOW and the cost equivalent curves. Future exploration could delve into better exploiting knowledge graphs and examining the implications of these strategies across other AI domains.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Nicholas Lourie (12 papers)
  2. Ronan Le Bras (56 papers)
  3. Chandra Bhagavatula (46 papers)
  4. Yejin Choi (287 papers)
Citations (130)
Github Logo Streamline Icon: https://streamlinehq.com