YapScore: Evaluating LLM Verbosity

Updated 8 January 2026

YapScore is a metric that quantifies excessive verbosity in large language models by comparing the output length to a minimal sufficient baseline.
It isolates over-generation by measuring extra characters across diverse prompts, thereby enabling direct benchmarking of model conciseness.
The metric provides actionable insights into cost, energy consumption, and cognitive load, influencing future improvements in LLM efficiency.

YapScore is a model evaluation metric introduced to quantify excess response length in LLM outputs when brevity is ideal, facilitating direct comparison of verbosity behavior across diverse assistant LLMs. Developed within the YapBench benchmark framework, YapScore isolates the axis of over-generation—defined as unnecessary or redundant response text—by measuring the number of characters in a model’s output that exceed a curated minimal-sufficient baseline per prompt. This metric is explicitly tokenizer-agnostic and enables precise assessment and ranking of LLMs based on their tendency to produce unnecessarily verbose answers, a behavior with significant implications for cognitive load, cost, and energy consumption (Borisov et al., 2 Jan 2026).

1. Conceptual Motivation and Problem Definition

YapScore addresses the problem of systematic verbosity in assistant LLMs (e.g., ChatGPT, Claude, Gemini), which frequently over-generate content in response to prompts where concise answers suffice. Over-generation includes restating the prompt, excessive explanations, hedging, and inclusion of boilerplate, which imposes additional cognitive load on users, increases per-token API costs, and raises energy consumption in deployed systems. Existing benchmarks predominantly reward correctness and safety, failing to measure concise sufficiency. YapScore, as the core metric of YapBench, isolates verbosity, explicitly quantifying unnecessary response length on single-turn, brevity-ideal prompts. A plausible implication is that YapScore enables targeted evaluation and improvement of model alignment with user preferences regarding succinctness (Borisov et al., 2 Jan 2026).

2. Dataset Construction and Prompt Categories

YapBench employs a collection of 304 single-turn English prompts, partitioned into three brevity-ideal categories that reflect common use cases where minimal answers are optimal:

Category	Size	Prompt Characterization
A	60	Minimal/ambiguous (e.g., “…”, “help”, “error”)
B	126	Closed-form factual (“What is the capital…?”)
C	118	One-line coding (single command/snippet)

Each prompt is paired with a manually curated, minimal-sufficient baseline answer. Baselines are iteratively reviewed to ensure that every character is necessary for correctness, clarity, and self-containment, avoiding extraneous explanation, formatting, or context. Prompts exhibiting ambiguity or instability are excluded. This dataset composition ensures YapScore remains robust and strictly diagnostic of verbosity, not correctness or factual coverage.

3. YapScore and Aggregate Metrics

For each prompt $i$ and model $M$ :

Let $B_i$ denote the baseline answer length in characters, and $L_i(M)$ the corresponding model output length.
The per-prompt metric is:

$\text{YapScore}_i(M) = \max\{0,\, L_i(M) - B_i\}$

YapScore of zero denotes perfect alignment with baseline brevity; positive values indicate excess text.

Aggregate performance employs category medians to reduce outlier impact:

For prompt set $I_c$ (category $c$ ), define $Y_c(M) = \text{median}\{\text{YapScore}_i(M);\, i \in I_c\}$
Uniform category weights $w_c = \frac{1}{3}$ yield the overall

$\text{YapIndex}(M) = \frac{Y_1(M) + Y_2(M) + Y_3(M)}{3}$

Lower YapIndex values indicate less verbosity across all categories, allowing direct comparison and leaderboard ranking of model conciseness (Borisov et al., 2 Jan 2026).

4. Evaluation Protocol and Procedures

YapBench evaluates 76 assistant LLMs from providers including OpenAI, Anthropic, Google DeepMind, xAI, Meta, and Mistral. Inference uses the OpenRouter API, with no system prompt, and temperature set to zero when available. Each test comprises a single-turn evaluation, excluding chain-of-thought and hidden reasoning traces from length counts. Response lengths are measured in raw character count, avoiding tokenizer dependencies. For every prompt-model pair, YapScore is calculated, followed by category medians and global YapIndex computation. Leaderboards rank models by YapIndex, incentivizing lower verbosity without reference to correctness or safety.

5. Empirical Observations and Failure Modes

YapBench reveals an order-of-magnitude spread in median verbosity among evaluated models, with YapIndex ranging from approximately 23 characters (gpt-3.5-turbo) to 1,427 characters (glm-4.5). Critical category-specific failure modes are observed:

Category A (“vacuum filling”): Models frequently generate unsolicited, lengthy content where a brief clarification is optimal.
Category B: Closed-form queries elicit unnecessary definitions, restatements, or caveats in lieu of minimal factual responses.
Category C: One-line technical prompts provoke formatting overhead, explanatory prose, code-block duplication, or multi-line outputs.

Notably, conciseness does not correlate monotonically with release date or perceived capability; older models (e.g., gpt-3.5-turbo) may outperform recent releases in prompt brevity. Variants incorporating explicit reasoning may decrease median verbosity but risk increased worst-case excess (Borisov et al., 2 Jan 2026).

6. Benchmark Availability, Utility, and Limitations

YapBench v0.1, including prompts, baselines, and category labels, is publicly released via HuggingFace. Leaderboards are continuously updated, and code for metric computation is open-source. Limitations include annotator subjectivity in baseline construction, potential prompt obsolescence (particularly factual Q&A), and provider-specific API preprocessing behaviors for nonstandard inputs. The benchmark’s strict brevity focus, by design, does not penalize legitimate verbosity when warranted. Future extensions propose new prompt types such as brevity-ideal refusals, misconception correction, and control cases requiring genuine elaboration, ensuring conciseness is rewarded only in genuinely brevity-critical settings (Borisov et al., 2 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Do Chatbot LLMs Talk Too Much? The YapBench Benchmark (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to YapScore.