Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 28 tok/s Pro
GPT-5 High 33 tok/s Pro
GPT-4o 70 tok/s Pro
Kimi K2 205 tok/s Pro
GPT OSS 120B 428 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs (2304.11164v1)

Published 22 Apr 2023 in cs.CL and cs.AI

Abstract: LLMs have become very popular recently and many claims have been made about their abilities, including for commonsense reasoning. Given the increasingly better results of current LLMs on previous static benchmarks for commonsense reasoning, we explore an alternative dialectical evaluation. The goal of this kind of evaluation is not to obtain an aggregate performance value but to find failures and map the boundaries of the system. Dialoguing with the system gives the opportunity to check for consistency and get more reassurance of these boundaries beyond anecdotal evidence. In this paper we conduct some qualitative investigations of this kind of evaluation for the particular case of spatial reasoning (which is a fundamental aspect of commonsense reasoning). We conclude with some suggestions for future work both to improve the capabilities of LLMs and to systematise this kind of dialectical evaluation.

Citations (20)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.