LLM-Driven Sample Construction

Updated 19 October 2025

The paper introduces a post-hoc LLM-driven sample construction method using round-trip consistency and prompt-based scoring to significantly improve question quality over greedy and random sampling.
This topic is defined by its use of optimization over candidate sets with metrics such as F₁ and ROUGE-L, combined with ensemble techniques for robust, zero-shot evaluation.
The approach offers practical applications in educational content, FAQ systems, and automated testing, reducing dependency on human-annotated references.

A LLM-driven sample construction method refers to a set of post-hoc strategies for selecting high-quality samples from the stochastic outputs generated by modern LLMs. In the context of question generation, and specifically under realistic constraints—where the LLM is a black box and human-annotated references are not available at inference time—the approach comprises prompt-based selection mechanisms that operate over candidate sets of model outputs. These methods have demonstrated marked improvements over greedy selection and random sampling, both in automatic metrics and human evaluations (Yuan et al., 2022).

1. Prompt-Based Selection Strategies

The methodology centers on two main post-sampling selection strategies:

Round-Trip Consistency: This strategy leverages cycle-consistent evaluation by passing each candidate question through a QA system (also LLM-based), thereby assessing whether the predicted answer matches the provided gold answer. The selection focuses on minimizing divergence between the answer produced by the QA system and the known reference, typically operationalized via metrics such as F₁ (for SQuAD) or ROUGE-L (for the Fairytale QA task):

$q' = QG(c, a)$ $a' = QA(c, q')$ Selection: $\operatorname*{argmax}_{q_i} \text{similarity}(a'_i, a)$

Prompt-Based Scoring: In this approach, the model is prompted (in two steps) to self-evaluate candidate questions along several human-aligned quality dimensions such as grammatical correctness, clarity, offensiveness, relevance, importance, specificity, and answerability. The prompt sequence asks for open-ended reasoning first, followed by a closed rating on a predefined scale. Two key scoring schemes emerge:
- OPS (Overall Prompt Score): One final rating per candidate.
- APS (Averaged Prompt Score): Dimension-wise ratings, averaged to provide the overall score.

In some configurations, these scores are ensembled with other evaluation signals (e.g., n-gram similarity, round-trip scores) after normalization to improve robustness.

2. Mathematical Formulation and Selection Objectives

The selection process is formalized as an optimization problem. Given a set of candidates $Q = \{ q_1, ..., q_k \}$ and a metric $M(q)$ , the objective is:

$S(Q) = i^* = \operatorname*{argmax}_i M(q_i)$

For n-gram similarity measures used in certain baselines:

$\operatorname{sim}^n = \frac{| s^n(c) \cap s^n(q) |}{| s^n(q) |}$

where $s^n(\cdot)$ extracts unique n-grams. For round-trip consistency, the process is:

$q' = QG(c, a); \quad a' = QA(c, q'); \quad \text{select } q' \text{ s.t. } a' \approx a$

Selection metric $M(q)$ can be the minimum, mean, or maximum over the ensemble of candidate-specific metrics.

3. Empirical Results and Evaluation Protocols

Evaluation involves both automatic and human-centered metrics:

Automatic:
- BLEU-4 for SQuAD and ROUGE-L for Fairytale QA task.
- All proposed selection methods (n-gram, round-trip, prompt-based scores, and ensembles) outperform naive baselines (greedy, sample mean).
Human:
- Questions rated on seven quality dimensions and an overall score.
- Prompt-based scores (APS), targeting multiple evaluative dimensions, demonstrate superior sample selection for datasets with complex, multi-paragraph contexts (Fairytale QA) compared to simple n-gram similarity, which excels in extractive, sentence-level contexts (SQuAD).
- Notably, ground-truth questions frequently do not receive top ratings under human evaluation.

Ensemble approaches—such as combining round-trip consistency with APS or tri-gram similarity—consistently yield the highest performance across both automatic and human judgments.

4. Domain Constraints and Real-World Deployment

The method is explicitly designed for settings where:

The LLM is accessed as a black box (API only, no parameter tuning permissible).
No human-labeled reference data is available for post-processing.
Selection must be fully automated and zero-shot, which is critical for commercial or low-resource deployment scenarios.

No model fine-tuning occurs; all improvements are attributable purely to post-hoc sample selection.

5. Limitations and Future Directions

Key limitations include:

Dependence on high-capability APIs (GPT-3) which may be cost-prohibitive or unavailable in some environments.
Performance variability as a function of domain: sample selection mechanisms yield stronger returns in specific context types (e.g., prompt-domain scoring for abstractive multi-paragraph answers versus n-gram similarity for sentence-level extractive tasks).
Risk of divergence or generation of harmful content in the absence of robust safeguards—all sampled questions are generated without direct human oversight, raising concerns for sensitive applications.

Future work aims to extend these techniques to open-source alternatives (GPT-J, BLOOM), improve robustness across varied question generation domains, and develop enhanced selection metrics that further mimic nuanced human judgment.

6. Practical Applications and Implications

The presented LLM-driven sample construction methods are suitable for applications such as:

Automated educational content creation (quiz/exam generation)
FAQ systems and interactive customer support
Any application where high-quality, diverse natural language generation enhances user experience and comprehension

The methods require no additional training or labeled data at inference, enabling practical adoption for organizations with only API access to proprietary LLMs. By demonstrating that post-hoc selection mechanisms are capable of substantially improving quality, the work establishes a blueprint for robust, scalable LLM deployment in real-world generative language tasks.

7. Summary of Key Strategies and Insights

Strategy	Mechanism	Evaluation Metric
Round-trip Consistency	Cycle QA with LLM	F₁, ROUGE-L
Prompt-based Scoring	Meta-prompts (multi-dimensional)	APS/OPS (avg scores)
Ensemble Selection	Combined normalized metrics	Highest composite score

In summary, LLM-driven sample construction by post-hoc prompt-based selection (round-trip consistency and meta-question scoring), without fine-tuning or reference data, raises the quality of generated outputs well above standard greedy or stochastic methods. Ensemble strategies further improve robustness and serve as reference architectures for real-world, black-box LLM deployment in question generation and related natural language tasks.

PDF Markdown Chat (Pro)

References (1)

Selecting Better Samples from Pre-trained LLMs: A Case Study on Question Generation (2022)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to LLM-Driven Sample Construction Method.