Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 147 tok/s

Gemini 2.5 Pro 42 tok/s Pro

GPT-5 Medium 33 tok/s Pro

GPT-5 High 28 tok/s Pro

GPT-4o 81 tok/s Pro

Kimi K2 190 tok/s Pro

GPT OSS 120B 449 tok/s Pro

Claude Sonnet 4.5 36 tok/s Pro

2000 character limit reached

GSSBench: LLM Generation Space Benchmark

Updated 15 October 2025

GSSBench is a benchmark framework that quantifies Generation Space Size (GSS) to evaluate the diversity and factuality of LLM outputs.
It employs a suite of paired prompt tasks and metrics like EigenScore to assess and compare model output spaces for both creative and factual generations.
The framework offers diagnostic tools for prompt ambiguity detection, reasoning analysis, and output diversity steering, demonstrating clear advantages over conventional diversity metrics.

GSSBench refers to a benchmark and analysis framework for evaluating and calibrating the open-endedness of outputs generated by LLMs through the formal notion of Generation Space Size (GSS). GSSBench provides both a conceptual schema for understanding diverse failure modes in LLMs and a concrete suite of prompt-based tasks for empirical assessment. The methodology unifies the paper of diversity collapse in creative generation and hallucination in factual generation by focusing on a model's internal effective generation space relative to the ground-truth task requirement.

1. Definition and Theoretical Foundations

Generation Space Size (GSS) is defined as the cardinality of the set of semantically distinct outputs a model "considers" when given a prompt. For any given prompt $p$ , there exists a ground-truth generation space $G_t(p)$ encapsulating all correct or appropriate responses, and a model-implied generation space $G_m(p)$ . The framework models the relationship as:

$|G_m(p)| = |G_t(p)| + \epsilon_m(p),$

where $\epsilon_m(p)$ is an error term quantifying mismatch between the model's output diversity and the true requirement of the task. Thus, GSS formalizes both the collapse to homogeneous outputs in creative tasks (where $|G_t(p)|$ is large) and over-diverse, hallucinated outputs in factual settings (where $|G_t(p)|$ is small).

2. GSSBench Evaluation Suite

GSSBench is instantiated as a task suite of prompt pairs, each constructed such that the ground-truth generation space sizes are ordered, i.e., $|G_t(x)| > |G_t(y)|$ for each pair $(x, y)$ . Six synthetic dataset types are constructed to facilitate this:

Dataset Type	Construction Principle	Example Relationship
Complement	One prompt is complement of another	"Generate a poem about the moon" vs. its complement
FactualQA	Single-answer factual vs. multi-answer prompt	"What is 2+2?" vs. "Name a country"
Random Choice	Explicitly enumerated options, varying counts	List $k$ random animals
Subset	Appending constraints to shrink output set	"Write a haiku" vs. "Write a haiku about frogs"
Union/Intersection	Combining/base prompts increases/decreases GSS	"Write a poem or a joke" (Union)

Prompt pairs in GSSBench are evaluated via pairwise accuracy of candidate metrics, with the correct assignment being higher value to the prompt with strictly larger $|G_t|$ .

3. Metrics for Estimating Effective Generation Space

Multiple metrics are tested as proxies for $|G_m(p)|$ :

Perplexity
Energy
Length-normalized entropy
Lexical similarity
Semantic entropy
EigenScore (and its variants)

The general relationship is modeled as:

$|G_m(p)| = f_m(p) + \delta_{f,m}(p),$

where $f_m$ is the candidate metric and $\delta_{f,m}(p)$ is its approximation error. EigenScore, calculated as the log-determinant of the covariance matrix of sentence embeddings derived from sampled generations, exhibits the strongest separation between prompts with different GSS in empirical evaluations—for example yielding clear bimodal score distributions distinctly correlating with ground-truth relationships.

A more detailed EigenScore computation uses:

$E_{average} = \frac{1}{|S| \cdot K} \sum_{\ell \in S} \log \det \left( (JZ^{(\ell)})(JZ^{(\ell)})^T + \alpha I_K \right),$

where $S$ is a set of response samples, $Z^{(\ell)}$ denotes layer-wise embeddings, $J$ is a centering matrix, $\alpha$ is a regularization parameter, and $K$ is the number of embedding dimensions.

4. Applications and Interpretive Utilities

GSS and GSSBench enable several diagnostic and practical applications:

Prompt Ambiguity Detection and Clarification Prediction: Higher EigenScore values correspond to more ambiguous prompts (i.e., larger $|G_m(p)|$ ), making GSS a diagnostic for flagging queries that may need clarification. Experiments with the RIFTS dataset show that EigenScore can reliably distinguish ambiguous from unambiguous prompts and predict whether clarification should be issued.
Reasoning Model Analysis: GSS is used to interpret "overthinking" (when $|G_m(p)|$ is excessively large relative to task complexity) and "underthinking" (insufficiently small $|G_m(p)|$ ). EigenScore correlates with the number and richness of reasoning trajectories, and with explicit reasoning token counts in controlled logic/reasoning tasks.
Steering for Output Diversity: The Leave-One-Out EigenScore (LOOE) calculates the EigenScore decrement when leaving out each candidate response from the set, thus providing a semantically-sensitive, per-sample diversity metric. Application of LOOE in the Direct Preference Optimization (DPO) framework (specifically DivPO) enables the model to be explicitly steered towards producing outputs that are both high-quality and exhibit the desired level of diversity.

5. Empirical Findings and Metric Performance

The GSSBench empirical evaluation demonstrates that hallucination-detection metrics, especially EigenScore variants, consistently outperform standard lexical/semantic diversity and uncertainty quantification metrics in tracking ground-truth GSS ordering across prompt pairs. For example, while perplexity and lexical entropy yield overlapping score distributions for prompts that differ in GSS, EigenScore produces clear, bimodal separations aligned with the designed relationships.

This suggests that conventional diversity metrics are insufficient to capture the nuanced calibration of model output space required for open-ended and factual tasks, whereas model-internal, embedding-based metrics afford both sharper discriminative power and interpretability of a model’s internal representation of task space.

6. Limitations and Future Directions

Several limitations and open directions are articulated:

The effectiveness of GSS as a calibration metric is content-agnostic—a model can be well-calibrated for GSS yet consistently generate incorrect content.
There is an observed "inverse scaling" effect, where larger, instruction-tuned models are sometimes less well-calibrated in GSS.
The search for improved proxy metrics—beyond EigenScore—for even finer-grained GSS estimation remains open.
A plausible implication is that aligning LLMs to be "GSS-aware" during pretraining or alignment phases could yield controllable trade-offs between factuality (low GSS) and creativity (high GSS).
Integration of GSS concepts with content-sensitive evaluation could provide more holistic adequacy and diversity diagnostics.

7. Significance and Impact

By unifying analysis of output diversity collapse and hallucination within a single framework, GSSBench provides an actionable methodology for diagnosing and controlling model generation space calibration. The clear empirical advantage of EigenScore-based metrics demonstrates the relevance of model-internal, semantic-space geometry to the qualitative control of LLM outputs. The framework facilitates future work in the alignment, evaluation, and steering of LLMs, particularly in open-ended generation settings where the calibration of creativity and factuality is essential (Yu et al., 14 Oct 2025).

PDF Markdown Chat (Pro)

References (1)

Generation Space Size: Understanding and Calibrating Open-Endedness of LLM Generations (2025)

Follow Topic

Get notified by email when new papers are published related to GSSBench.