Papers
Topics
Authors
Recent
Search
2000 character limit reached

QuantSightBench: Benchmarking LLM Uncertainty

Updated 3 July 2026
  • QuantSightBench is a benchmark designed to assess LLMs’ capacity to generate calibrated prediction intervals for continuous quantitative forecasting tasks.
  • It utilizes rigorous evaluation metrics including empirical coverage, interval sharpness, and the Mean Log Interval Score to measure uncertainty quantification.
  • The benchmark supports stratified evaluations across diverse domains and retrieval settings, highlighting both performance strengths and calibration challenges.

QuantSightBench is a benchmark designed for rigorous evaluation of LLMs on quantitative forecasting tasks requiring explicit uncertainty quantification through prediction intervals. Unlike traditional LLM benchmarks that rely on discrete formats such as binary or multiple-choice questions, QuantSightBench assesses models on their capacity to generate calibrated numerical estimates over continuous domains, emphasizing scale awareness and internal consistency across confidence levels. The benchmark formalizes empirical coverage and sharpness metrics, introduces a proper scoring rule (Mean Log Interval Score), and provides stratified evaluation protocols across various settings and domain difficulties, delivering granular insights into LLM performance under uncertainty (Qin et al., 17 Apr 2026).

1. Motivation and Scope

Existing LLM evaluation benchmarks (e.g., ForecastBench, FutureX, Metaculus) generally reduce forecasting to binary or multiple-choice formats, which are insufficient for real-world decision-making contexts that require quantitative and continuous predictions. These legacy formats do not assess models' ability to:

  • Produce numerical point or interval estimates for quantities such as commodity deficits or population projections.
  • Quantify and communicate uncertainty explicitly.
  • Exhibit sensitivity to the scale or order of magnitude of the predicted variable.
  • Demonstrate internal consistency across different levels of prediction confidence (e.g., 90% intervals should contain 50% intervals).

Prediction intervals—intervals [Lα(x),Uα(x)][L_\alpha(x), U_\alpha(x)] such that the model asserts the true outcome tt lies within the range with probability 1α1-\alpha—are adopted as the evaluation interface. This design: (a) ensures explicit and testable uncertainty quantification via empirical coverage, (b) demands scale-adaptive interval sizing, and (c) enforces internal nesting consistency across multiple confidence levels. Unlike full posterior distributions, prediction intervals provide richer information than point estimates but are easier to elicit and evaluate (Qin et al., 17 Apr 2026).

2. Benchmark Construction and Dataset

The QuantSightBench dataset is constructed using the OpenForecast scraping and automatic question-generation pipeline. News articles scraped from CommonCrawl (January–August 2025) provide the source material, from which forecasting questions targeting future outcomes with resolution dates between September 2025 and January 2026 are generated. Only questions requiring non-trivial quantitative reasoning are retained. Each sample solicits a continuous-value numerical answer—e.g., annual silver supply deficit, national population, major accident fatalities—spanning diverse domains (see table).

Domain Count %
Business, Finance & Technology 258 25.8
Politics & Geopolitics 185 18.5
Infrastructure & Transport 122 12.2
Culture & Entertainment 113 11.3
Crime & Legal 101 10.1
Sports 96 9.6
Science, Health & Environment 70 7.0
Education 55 5.5
Total 1000 100.0

Forecasting settings include:

  • Zero-shot: model receives only the question; tests memorized knowledge.
  • Background-context: question plus a curated informational paragraph.
  • Agentic: iterative retrieval from a ∼320K chunk corpus (embedded via text-embedding-3-large), allowing multi-step evidence gathering prior to interval production (Qin et al., 17 Apr 2026).

3. Formal Evaluation Framework

Prediction Interval Definition

A prediction interval at nominal confidence 1α1-\alpha is PIα(x)=[Lα(x),Uα(x)]\mathrm{PI}_{\alpha}(x) = [L_\alpha(x), U_\alpha(x)]. Models are required to explicitly report both bounds.

Empirical Coverage

Empirical coverage at level 1α1-\alpha is defined as

Coverage(α)=1Ni=1N1(ti[Lα(xi),Uα(xi)]),\mathrm{Coverage}(\alpha) = \frac{1}{N}\sum_{i=1}^N \mathbf{1}\bigl(t_i \in [L_\alpha(x_i), U_\alpha(x_i)]\bigr),

where tit_i is the ground truth. Ideally, Coverage(α)1α\mathrm{Coverage}(\alpha) \approx 1-\alpha.

Interval Sharpness

Sharpness quantifies informativeness and is given by mean interval width:

Sharpness(α)=1Ni=1N(Uα(xi)Lα(xi)).\mathrm{Sharpness}(\alpha) = \frac{1}{N}\sum_{i=1}^N (U_\alpha(x_i) - L_\alpha(x_i)).

Proper Scoring: Mean Log Interval Score (MLIS)

To incentivize both calibration and tight intervals, QuantSightBench adopts the Winkler interval score on log-transformed values. For a single interval tt0 at level tt1 and target tt2:

tt3

The aggregate MLIS is:

tt4

The log transform (1) normalizes scales, (2) ensures properness, and (3) mitigates dominance of large-magnitude questions.

Calibration Curves and Internal Consistency

Calibration curves plot empirical coverage against nominal coverage tt5 across tt6; perfect calibration yields a diagonal plot. Internal consistency is required: for tt7,

tt8

Empirically, intervals should widen monotonically with confidence and avoid crossings (Qin et al., 17 Apr 2026).

4. Model Evaluation Protocol

The evaluation pool includes both proprietary and open-weight LLMs, all with knowledge cut-off before September 2025. Notable models: Google Gemini 3.1 Pro, Anthropic Claude x2/Grok 4, OpenAI GPT-5.4, GPT-5.1, Gemini 3 Pro, Sonnet 4.5, GLM-4.7, DeepSeek v3.2, Opus 4.5/4.6, and KimiX. Each is tested at confidence levels of 80%, 90%, 95%. Ablations cover:

  • Prompt type: zero-shot, background-context, agentic.
  • Required reasoning effort: low, medium, high.
  • Inclusion/omission of explicit tt9 (confidence) in the prompt.

Primary metrics are empirical coverage and MLIS at each 1α1-\alpha0. Secondary analyses stratify by ground-truth magnitude (e.g., 0–1, 1–10, 10–100, ..., 100K+), number of agentic retrieval steps, and relationship between relative interval width 1α1-\alpha1 and mean relative error. The full protocol and agentic prompting templates are publicly available (Qin et al., 17 Apr 2026).

5. Empirical Results

Coverage and Sharpness

None of the evaluated models achieve the nominal 90% coverage target in the agentic setting:

Model Coverage @ 90% Gap to 90%
Gemini 3.1 Pro 79.1% −10.9%
Grok 4 76.4% −13.6%
GPT-5.4 75.3% −14.7%
GPT-5.1 74.6% −15.4%
Opus 4.6 73.6% −16.4%
DeepSeek v3.2 61.5% −28.5%
GLM-4.7 62.7% −27.3%

Systematic overconfidence is observed: models assign too-narrow intervals that under-cover true outcomes.

Pareto analysis (Coverage vs. MLIS) reveals that frontier models cluster favorably (higher coverage, lower MLIS). GPT-5.1, despite slightly lower coverage than GPT-5.4, produces tighter (lower-MLIS) intervals.

Effects of Scale and Question Difficulty

Coverage decays significantly as the ground-truth magnitude increases: in the 1–10 range, coverage exceeds 80% for many models, while for questions with magnitudes above 100K, some models fall below 50%. MLIS is elevated for both fractional (1α1-\alpha21) and very large (1α1-\alpha3100K) targets, highlighting limited model calibration at scale extremes.

Questions resolved in a single agentic retrieval achieve 1α1-\alpha486% coverage with low MLIS; those requiring five retrievals see drops to 1α1-\alpha565% coverage with higher MLIS. The need for many retrievals indicates question difficulty, rather than an adverse effect of retrieval on calibration.

There is a positive correlation between the relative interval width and the mean relative error of the interval midpoint, demonstrating that models conditionally widen intervals when they sense greater uncertainty, but not enough to attain nominal coverage (Qin et al., 17 Apr 2026).

Ablation Studies

  • Providing background-context improves both coverage and MLIS compared to zero-shot. The agentic setting benefits open-weight models more than proprietary ones.
  • Higher reasoning effort (medium/high) yields improved calibration and sharper intervals, especially for weaker models.
  • Omitting the “90% interval” instruction in the prompt sharply reduces coverage (e.g., GPT-5.4 drops from 75.3% to 68.2%) and inflates MLIS.
  • At higher confidence levels (80%, 90%, 95%), models widen intervals but systematic under-coverage persists. MLIS may improve at higher targets due to fewer out-of-interval penalties (Qin et al., 17 Apr 2026).

6. Analysis and Recommendations

QuantSightBench surfaces several pathologies and improvement vectors in current LLM quantitative reasoning:

  • Scale insensitivity: Models consistently underestimate uncertainties, especially for extreme magnitudes, failing to proportionally widen intervals for very large or small quantities.
  • Domain prior limitations: Iterative retrieval aids open-weight/weaker models substantially; frontier models appear to memorize priors yet fail at calibration in unseen regimes.
  • Instruction tuning bias: Models tuned via RLHF to favor point predictions tend to exhibit degraded calibration, particularly in distributional tails.

Calibration is noticeably worse at extreme magnitudes and at higher nominal confidence levels. Internal consistency violations (interval crossings) arise, indicating some models fail to properly nest intervals as confidence requirements change.

Improvement recommendations include:

  • Proper-scoring fine-tuning: Integrating MLIS or Winkler interval score as RL reward targets to incentivize more honest interval widths.
  • Scale-aware pretraining: Explicitly training with tasks spanning diverse orders of magnitude.
  • Hierarchical elicitation: Jointly eliciting intervals at multiple confidence levels with enforced monotonicity constraints (Qin et al., 17 Apr 2026).

7. Leaderboard, Tooling, and Ongoing Evaluation

QuantSightBench provides all code, prompting templates (including the agentic template), and data-generation scripts at https://github.com/aisa-group/quantsightbench, with a continuously updated leaderboard at https://quantsightbench.com/. The benchmark pipeline supports ongoing evaluation of emerging LLMs for reproducibility and extensibility, offering a unified protocol for quantitative uncertainty assessment in LLMs (Qin et al., 17 Apr 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to QuantSightBench.