YapBench: Evaluating LLM Verbosity
- YapBench is a reference-anchored benchmarking framework that quantifies user-visible over-generation in LLM responses through minimal sufficient baselines.
- It comprises 304 curated prompts across three categories—ambiguous, factual, and coding—to isolate and measure verbosity in concise answer scenarios.
- Using metrics like YapScore and YapIndex, the framework enables objective cross-model comparisons and informs optimization for concise LLM behavior.
YapBench is a reference-anchored benchmarking framework designed to quantify user-visible over-generation in assistant LLMs on brevity-ideal prompts. It provides a suite of single-turn prompts paired with minimal sufficient baseline answers and precise categorization, enabling the objective evaluation of excessive verbosity in model responses. YapBench addresses a previously unmeasured aspect of assistant LLM deployment: the tendency of models to produce unnecessarily long outputs, which can increase cognitive load, inference latency, and token-based costs, especially in cases where conciseness is normatively preferred (Borisov et al., 2 Jan 2026).
1. Motivation and Scope
The problem of over-generation in assistant LLMs manifests as unnecessary boilerplate, restatements, generic caveats, and explanations on queries where terse answers suffice. This “yapping” not only burdens users with additional cognitive effort due to excess reading and scrolling, but also inflates latency and token-based API costs, as increased output length directly drives computation. Preference-based post-training and LLM-evaluated reward models systematically favor longer responses under conditions of comparable quality. Existing evaluation protocols entangle verbosity with correctness and safety, offering no direct mechanism to isolate and measure verbosity in intendedly brief scenarios. YapBench introduces a tokenization-agnostic, baseline-referenced benchmark to fill this gap, quantifying excess response length for brevity-ideal tasks and enabling precise, cross-model comparisons independent of any judge or tokenizer.
2. Dataset Structure and Prompt Categorization
YapBench v0.1 comprises 304 English prompts, each functioning as a single-turn, brevity-ideal request. Every prompt is paired with a curated “minimal sufficient” baseline answer and assigned to one of three rigorously defined categories:
| Category | Description | Example Prompt | Baseline Answer |
|---|---|---|---|
| A | Minimal or ambiguous input | :):):) |
"Could you clarify your question?" (31 chars) |
| B | Closed-form factual question | Who wrote 1984? |
"George Orwell" (13 chars) |
| C | One-line coding/command | Count lines in file.txt using wc |
"wc -l file.txt" (14 chars) |
- Category A includes queries with unclear or minimal semantic content (e.g., empty input, keyboard mashes, phatic tokens, nondescript requests). The ideal model behavior is a minimal clarification or acknowledgment.
- Category B contains factual prompts yielding unique, stable answers—typically a single word or short phrase without exposition.
- Category C encompasses requests for single-line shell commands, regexes, SQL statements, or code snippets, for which maximal conciseness (no extra formatting or explanation) is warranted.
Baselines are constructed by the following rules: unambiguous intent, absolute minimum token count for clarity and correctness, atomicity and self-containment, idiomatic selection when possible, and reproducibility across versions.
3. Benchmark Metrics
YapBench employs two primary metrics, both referenced to minimal sufficient baselines:
- YapScore (per-prompt): Measures excess response length over the baseline in characters, agnostic to tokenization. For prompt and model :
where is the response length and is the baseline length.
- YapIndex (aggregate): A uniformly weighted average of category-level median YapScores:
A lower YapIndex signifies greater overall conciseness for brevity-oriented tasks. 95% bootstrap percentile confidence intervals are estimated via category-wise prompt resampling.
4. Evaluation Protocol and Model Coverage
The evaluation encompasses 76 assistant LLM endpoints, covering a spectrum of contemporary models (e.g., OpenAI's GPT-3.5-turbo, GPT-4, GPT-5; Anthropic's Claude Sonnet/Opus/Haiku; Google Gemini 2.x/3.x; Grok; GLM-4.x; Qwen3; Llama 3/4; Mistral). Inference is performed via the OpenRouter API, eschewing system prompts and using (deterministic decoding) wherever permitted. All tests are single-turn, context-free, with reasoning chains excluded from output length measurement when separately reported. Both model responses and baseline answers are evaluated strictly in character length via the YapScore definition.
5. Key Findings and Model Behaviors
YapBench results reveal order-of-magnitude variation in verbosity. YapIndex spans from ≈23 (GPT-3.5-turbo, 2023) to >1,400 (GLM-4.5), indicating substantial diversity in concise response rates across model and provider. Notably, recent “frontier” models such as GPT-5 and Gemini-Pro often outperform older models in language capability but are more verbose on brevity-ideal queries—a non-monotonic relationship between capability and verbosity.
Distinct category-specific failure modes emerge:
- Category A: Models frequently exhibit “vacuum-fill,” producing generic help text or attempted answers rather than requesting clarification.
- Category B: Responses are padded with restatements, cautionary statements, or bullet points that obscure atomic facts.
- Category C: Common overhead materializes in the form of markdown formatting, explanatory prose, extraneous headings, or repeated code blocks that fail the one-line criterion.
Enabling chain-of-thought reasoning does not consistently raise median YapIndex but can cause “verbosity bursts” in edge cases. This suggests that explicit reasoning modes introduce intermittent, rather than systematic, length inflation.
6. Benchmark Resources and Leaderboard
The YapBench dataset (prompts, baselines, categories) is publicly available on Hugging Face: https://huggingface.co/datasets/tabularisai/yapbench_dataset. Benchmark scripts and metric implementations are hosted in an open-source repository linked from Tabularis.ai. A continuously updated live leaderboard is maintained at https://huggingface.co/spaces/tabularisai/YapBench, tracking new and updated models for YapIndex and YapTax metrics.
7. Limitations and Prospective Extensions
YapBench baseline construction involves some degree of annotator subjectivity, particularly in defining minimal sufficiency for ambiguous cases. Periodic audits are required for time-sensitive or evolving factual prompts. Provider-specific API preprocessing can further affect the treatment of blank or semantically null inputs. YapBench is explicitly scoped to brevity-ideal settings and does not penalize detailed responses where extended explanation is warranted.
Planned extensions include: measurement of brevity in safety/refusal prompts, brevity in brief corrections for misconception-trap scenarios, and construction of verbosity-control sets where detailed answers are desirable—to ensure that conciseness optimization does not result in pathological brevity.
YapBench delivers an interpretable framework for quantifying unnecessary verbosity on prompts where succinctness is optimal, thereby enabling more targeted development, selection, and fine-tuning of assistant LLMs for concise behaviors (Borisov et al., 2 Jan 2026).