UQ-Dataset: Benchmark for Unsolved Qs
- UQ-Dataset is a curated benchmark of 500 unsolved real-world questions from Stack Exchange, designed to assess high-difficulty reasoning and factuality in language models.
- It employs a three-stage curation process—automated extraction and rule-based filtering, LLM-based screening, and expert human review—to ensure only objectively challenging questions are selected.
- The evaluation framework uses oracle-free multi-level validation and community feedback, ensuring robust and transparent benchmarking of generative model performance.
The UQ-Dataset is a curated benchmark comprising 500 challenging, real-world unsolved questions sourced from Stack Exchange, constructed specifically to evaluate LLM performance in open-ended, high-difficulty reasoning and factuality tasks. It is designed to overcome the limitations of conventional exam-style and user-query benchmarks by targeting questions that are both difficult (defying community solution over years) and realistically posed in authentic usage settings. Evaluation within the UQ-Dataset leverages oracle-free LLM-driven validation pipelines, followed by human expert verification, to gauge model capability in producing objectively correct solutions for questions where no ground-truth answers exist. This paradigm enables rigorous, real-world assessment of generative models at the frontier of open-domain question answering (Nie et al., 25 Aug 2025).
1. Dataset Construction Pipeline
The UQ-Dataset is assembled via a three-stage, multi-agent pipeline:
- Automated Extraction and Rule-Based Filtering: Millions of unanswered questions are scraped from >80 Stack Exchange network sites via the official API. Rule-based postprocessing removes low-value candidates using criteria such as age (>2 years old), minimum views, upvote quotas per site, inclusion in the top 10% of unanswered posts by votes, and absence of community answers—for example, reducing a 3-million-question raw pool to ∼33,916 candidates.
- LLM-Based Screening: Each question is paired with an LLM-generated answer (e.g., GPT-4o). A secondary reasoning LLM then scores the pair using a multidimensional rubric: clarity (well-defined, unambiguous), approachability, objective answerability, plus direct difficulty estimation metrics—such as "answer correctness" and "expert solvability" judged on a [0,1] scale. Questions must satisfy all binary criteria and fall below average thresholds for correctness and solvability to advance, ensuring selection of objectively difficult items.
- Human Expert Review: PhD-level annotators, with domain expertise, conduct a final round of curation. Content is filtered for duplication, triviality, policy compliance, and quality control, with a "diamond subset" of 25 high-engagement questions flagged for special consideration. In high-volume domains, crowdsourced moderation supplements expert review.
This sequential pipeline ensures that the resulting 500-question dataset is composed of questions that are demonstrably hard (as judged by both models and humans), realistic, and suitable for rigorous benchmarking.
2. Scope and Structure of Questions
The dataset covers a wide range of domains—mathematics, computer science theory, history, natural sciences, science fiction, mythology, and engineering—capturing the diversity of authentic information-seeking by knowledgeable users. Each entry includes metadata such as site of origin, markdown-formatted text, tags, and context.
A defining characteristic is that all questions are actively unsolved: they have remained unanswered on Stack Exchange despite significant attention, marking a structural difference from artificially generated or low-stakes questions. This design ensures a prevalence of complex, sometimes open-ended tasks (e.g., proofs, derivations, cross-domain synthesis), and in non-STEM domains, challenging browser-style or identification tasks.
3. Oracle-Free Validation and Compound Evaluation
With no preexisting "gold answers" by construction, UQ-Dataset implements a multi-layered, oracle-free evaluation framework:
- LLM-Based Validators: Multiple LLM-based strategies are deployed: (i) correctness checking (does the answer fully resolve the question), (ii) factual and logical consistency, and (iii) cycle consistency—using the answer to reconstruct the question and checking for alignment.
- Iterated and Ensemble Reflection: Repeated LLM sampling and iterative self-critique aggregate into a final judgment via majority or unanimous voting. For example, the acceptance criterion is formalized as
where is the th LLM validator's binary verdict and is a threshold.
- Compound Aggregation: Strategies such as majority vote, unanimous vote, and sequential pipeline validation are used to synthesize aggregate labels, improving robustness.
- Generator–Validator Gap: Models are, empirically, more accurate at validating answers than generating them (e.g., answer accuracy ∼20%, validation accuracy ∼65%), enabling compounded signal even in the absence of perfect oracle verification.
A plausible implication is that benchmark scores on the UQ-Dataset are more stable and less susceptible to gaming, since passing multi-stage validation is non-trivial for any single model.
4. Platform, Continuous Improvement, and Community Verification
The UQ-Dataset is maintained on an open platform (https://uq.stanford.edu), which also provides community infrastructure:
- Each question is paired with candidate answers from models and all evaluation metadata.
- Experts (including original Stack Exchange participants and domain specialists) are invited to issue secondary judgments, provide corrections, or confirm high-quality solutions.
- This enables perpetual updating: as models evolve and new answers are proposed, previously unsolved questions may move to "resolved" status given community consensus.
- All dataset state (including answer provenance and judgment history) is documented for transparency and reproducibility.
5. Performance Metrics and Empirical Findings
Quantitative analysis includes:
- Pass rates: Only 15% of questions receive verified correct answers with current top models (e.g., o3-pro), with most models under 10% pass rates using the 3-iteration validator pipeline—emphasizing the dataset's challenge.
- Surrogate evaluations: Metrics such as accuracy, precision, and recall are calculated on transfer sets (e.g., Humanity’s Last Exam), demonstrating improved validator precision and accuracy when using ensemble or iterative schemes (at some cost to recall).
- Human verification: Preliminary manual review has already confirmed some correct model-generated answers, demonstrating the protocol’s practical effectiveness and ability to push the knowledge frontier.
6. Real-World Impact and Scientific Significance
By design, answering UQ-Dataset questions provides intrinsic real-world value: questions are harvested from genuine user queries on technical forums, and solutions immediately benefit the source community. In mathematics and computer science, for example, correct answers may resolve open research-level queries.
The continuously updated, community-verified design ensures ongoing scientific impact: as models improve, the testbed itself evolves, and progress is directly linked to solving authentic, information-rich tasks.
7. Notation and Mathematical Rigor
Throughout, rigorous mathematical notation is employed to express evaluation strategies, dataset curation metrics, and (for example) some question content. Typical formulas include:
- Validator aggregation:
for accept/reject thresholds.
- Averaged correctness criteria:
as a screening measure in LLM-based filtering.
- For technical question samples, some entries feature integral expressions such as
reflecting the advanced level of many questions.
In aggregate, the UQ-Dataset establishes an unprecedented, verifiable, and community-driven paradigm for benchmarking open-domain question solving in frontier LLMs. It fuses high difficulty and authenticity, leverages rigorous multi-level oracle-free validation, and promotes direct knowledge advancement through expert-verified solutions (Nie et al., 25 Aug 2025).