Grounding Synthetic Data Evaluations of Language Models in Unsupervised Document Corpora (2505.08905v2)

Published 13 May 2025 in cs.AI and cs.CL

Abstract: LLMs (LMs) continue to advance, improving response quality and coherence. Given Internet-scale training datasets, LMs have likely encountered much of what users may ask them to generate in some form during their training. A plethora of evaluation benchmarks have been constructed to assess model quality, response appropriateness, and reasoning capabilities. However, the human effort required for benchmark construction is rapidly being outpaced by the size and scope of the models under evaluation. Having humans build a benchmark for every possible domain of interest is impractical. Therefore, we propose a methodology for automating the construction of fact-based synthetic data model evaluations grounded in document populations. This work leverages the same LMs to evaluate domain-specific knowledge automatically, using only grounding documents (e.g., a textbook) as input. This synthetic data benchmarking approach corresponds well with human curated questions producing a Spearman ranking correlation of 0.97 and a benchmark evaluation Pearson accuracy correlation of 0.75. This novel approach supports generating both multiple choice and open-ended synthetic data questions to gain diagnostic insight of LM capability. We apply this methodology to evaluate model performance on two recent arXiv preprints, discovering a surprisingly strong performance from Gemma-3 models on open-ended questions. Code is available at https://github.com/mmajurski/grounded-synth-lm-benchmark

PDF Abstract

Grounding Synthetic Data Evaluations of LLMs in Unsupervised Document Corpora

The paper "Grounding Synthetic Data Evaluations of LLMs in Unsupervised Document Corpora" by Michael Majurski and Cynthia Matuszek, addresses a pivotal challenge faced in evaluating the capabilities of modern LLMs (LMs): the rapid scalability constraints of human-mediated benchmark generation amidst growing model complexities. Traditional evaluation benchmarks, though insightful, do not scale effectively with the exponential increase in LM sophistication and domain-specific applications. The authors propose an innovative methodology to automate benchmark creation using synthetic data grounded in document corpora, thus improving the evaluation process's scalability, relevance, and adaptability.

Methodology

The methodology focuses on leveraging LMs themselves to generate synthetic evaluation data grounded in factual document sources such as textbooks or professional manuals. This approach is designed to circumvent the impracticality of human-curated benchmarks for every conceivable domain by utilizing grounding documents to generate diagnostics about LM capabilities accurately.

Key Processes Include:

Document Chunking: Breaking down authoritative documents into manageable sections to serve as context for question generation.
Topic Extraction: Using LMs to identify pertinent topics within each document section, ensuring the questions are relevant and cover the document's breadth.
Question and Answer Generation: LMs generate multiple-choice and open-ended questions, along with correct answers and explanations, grounded in the document chunks.

The generated synthetic benchmarks showed a strong correlation with human-prepared ones, with a Spearman ranking correlation of 0.96 and a benchmark accuracy Pearson correlation of 0.79. These results validate the approach's efficacy in replicating human-like evaluation settings while reducing the necessary human intervention significantly.

Implications and Evaluation

This automated methodology presents significant implications for the LM evaluation domain, bridging a critical gap in scalable and domain-specific evaluations. The paper reveals that the synthetic questions, although sometimes longer and more detailed, do not detract from their diagnostic value compared to human-written counterparts. It suggests potential bias when questions contain excessive detail, inadvertently raising question-answering difficulty.

To demonstrate practical utility, the research was applied to evaluate LM performance on a recent arXiv preprint, which uncovered strong outputs from specific models like Gemma3. This empirical testing within an authentic academic context underscores the system's robustness in real-world scenarios.

Future Directions

This method's reliance on high-quality grounding documents ensures benchmarks' factual accuracy and domain relevance, potentially accelerating LM integration into professional environments. Future work may focus on refining topic extraction and question generation to enhance semantic diversity and alignment further. Moreover, integrating tables and non-textual document elements hold promise for richer data-driven evaluations.

In summary, Majurski and Matuszek's paper provides a comprehensive, scalable framework for LM benchmarking that adjusts dynamically to domain-specific needs. It advances the field significantly, providing a pathway toward more efficient and relevant LM evaluations while laying a substantial foundation for future enhancements in automated benchmarking methodologies.

PDF Markdown Bookmark Chat (Pro)

Authors (2)

Michael Majurski (5 papers)
Cynthia Matuszek (23 papers)

Related Papers

Find Related Papers

YouTube

Show All Videos