Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 96 tok/s
Gemini 2.5 Pro 51 tok/s Pro
GPT-5 Medium 35 tok/s
GPT-5 High 43 tok/s Pro
GPT-4o 106 tok/s
GPT OSS 120B 460 tok/s Pro
Kimi K2 228 tok/s Pro
2000 character limit reached

LMentry: A Language Model Benchmark of Elementary Language Tasks (2211.02069v2)

Published 3 Nov 2022 in cs.CL, cs.AI, and cs.LG

Abstract: As the performance of LLMs rapidly improves, benchmarks are getting larger and more complex as well. We present LMentry, a benchmark that avoids this "arms race" by focusing on a compact set of tasks that are trivial to humans, e.g. writing a sentence containing a specific word, identifying which words in a list belong to a specific category, or choosing which of two words is longer. LMentry is specifically designed to provide quick and interpretable insights into the capabilities and robustness of LLMs. Our experiments reveal a wide variety of failure cases that, while immediately obvious to humans, pose a considerable challenge for LLMs, including OpenAI's latest 175B-parameter instruction-tuned model, TextDavinci002. LMentry complements contemporary evaluation approaches of LLMs, providing a quick, automatic, and easy-to-run "unit test", without resorting to large benchmark suites of complex tasks.

Citations (18)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.