Papers
Topics
Authors
Recent
Search
2000 character limit reached

A Judge-free LLM Open-ended Generation Benchmark Based on the Distributional Hypothesis

Published 13 Feb 2025 in cs.CL | (2502.09316v1)

Abstract: Evaluating the open-ended text generation of LLMs is challenging because of the lack of a clear ground truth and the high cost of human or LLM-based assessments. We propose a novel benchmark that evaluates LLMs using n-gram statistics and rules, without relying on human judgement or LLM-as-a-judge approaches. Using 50 question and reference answer sets, we introduce three new metrics based on n-grams and rules: Fluency, Truthfulness, and Helpfulness. Our benchmark strongly correlates with GPT-4o-based evaluations while requiring significantly fewer computational resources, demonstrating its effectiveness as a scalable alternative for assessing LLMs' open-ended generation capabilities.

Summary

  • The paper proposes a novel benchmark that leverages the distributional hypothesis to assess LLM outputs without human judgment.
  • It employs character-level n-gram statistics focusing on fluency, truthfulness, and helpfulness, achieving a 0.9896 correlation with GPT-4o evaluations.
  • The approach offers a scalable and resource-efficient alternative validated against existing LLM benchmarks, particularly for Japanese language models.

A Judge-free LLM Open-ended Generation Benchmark Based on the Distributional Hypothesis

Introduction

The paper presents a novel approach to evaluating LLMs for open-ended text generation using a benchmark grounded in the distributional hypothesis. The motivation stems from the high cost and variability associated with human or LLM-as-a-judge evaluations, which the proposed benchmark aims to alleviate by employing deterministic n-gram statistics. This benchmark is designed to assess LLMs across three metrics: Fluency, Truthfulness, and Helpfulness. Importantly, the benchmark strongly correlates with GPT-4o-based evaluations while being significantly more resource-efficient.

Methodology

Benchmark Design

The benchmark is built around 50 carefully curated questions spanning multiple disciplines, such as language, social studies, and sciences. Each question is paired with reference answer sets constructed using state-of-the-art Japanese-specific LLMs. The evaluation framework is structured to yield scalable and reliable performance metrics based on the n-gram distribution, without relying on subjective judgment. Figure 1

Figure 1: Evaluation outline illustrating the workflow of generating and evaluating responses using the benchmark framework.

Evaluation Metrics

  • Fluency is measured through the inner product of 10-gram occurrence ratios against a reference set, ensuring that generated text aligns closely with expected language patterns.
  • Truthfulness assesses the prevalence of reliable factual information in responses, ensuring that low-frequency hallucinations are minimized.
  • Helpfulness evaluates the informational completeness of responses based on predefined key phrases for each question.

Results

The study confirms the benchmark's robustness by comparing its scores against an LLM-as-a-judge system (GPT-4o), where it achieves a high correlation coefficient of 0.9896, demonstrating comparable reliability with significantly reduced computational overhead. Figure 2

Figure 2: Comparison between our benchmark and GPT-4o's LLM-as-a-judge evaluation method.

Further analyses against existing benchmarks such as Nejumi LLM Leaderboard 3 and Japanese MT-Bench reveal correlations above 0.7, underscoring the method's comprehensive assessment across various LLM features. Figure 3

Figure 3: Comparison between our benchmark and Nejumi LLM Leaderboard 3 shows strong alignment in performance evaluations.

Discussion

The paper posits that the adoption of character-level n-grams, tailored to the Japanese language's unique attributes, provides a stable and scalable evaluation mechanism for open-ended text generation tasks. This approach is a significant shift from LLM-as-a-judge models, offering a cost-effective alternative that maintains high fidelity to established performance metrics.

Conclusion

By leveraging the distributional hypothesis, this judge-free benchmark offers a viable, resource-efficient method to assess LLM capabilities for open-ended tasks. Its strong correlations with existing high-cost evaluation methods affirm its validity and potential as a new standard in the field of LLM evaluation, particularly for applications with resource constraints or when trained on specific LLMs.

The full implementation details and materials are accessible on GitHub, allowing for community engagement and further development of this approach.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.