Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

COM2SENSE: A Commonsense Reasoning Benchmark with Complementary Sentences (2106.00969v1)

Published 2 Jun 2021 in cs.CL and cs.AI

Abstract: Commonsense reasoning is intuitive for humans but has been a long-term challenge for AI. Recent advancements in pretrained LLMs have shown promising results on several commonsense benchmark datasets. However, the reliability and comprehensiveness of these benchmarks towards assessing model's commonsense reasoning ability remains unclear. To this end, we introduce a new commonsense reasoning benchmark dataset comprising natural language true/false statements, with each sample paired with its complementary counterpart, resulting in 4k sentence pairs. We propose a pairwise accuracy metric to reliably measure an agent's ability to perform commonsense reasoning over a given situation. The dataset is crowdsourced and enhanced with an adversarial model-in-the-loop setup to incentivize challenging samples. To facilitate a systematic analysis of commonsense capabilities, we design our dataset along the dimensions of knowledge domains, reasoning scenarios and numeracy. Experimental results demonstrate that our strongest baseline (UnifiedQA-3B), after fine-tuning, achieves ~71% standard accuracy and ~51% pairwise accuracy, well below human performance (~95% for both metrics). The dataset is available at https://github.com/PlusLabNLP/Com2Sense.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Shikhar Singh (8 papers)
  2. Nuan Wen (7 papers)
  3. Yu Hou (43 papers)
  4. Pegah Alipoormolabashi (6 papers)
  5. Te-Lin Wu (18 papers)
  6. Xuezhe Ma (50 papers)
  7. Nanyun Peng (205 papers)
Citations (49)
Github Logo Streamline Icon: https://streamlinehq.com