Papers
Topics
Authors
Recent
Search
2000 character limit reached

Future-as-Label: Scalable Supervision from Real-World Outcomes

Published 9 Jan 2026 in cs.LG and cs.AI | (2601.06336v2)

Abstract: Time creates free supervision: forecasts about real-world events resolve to verifiable outcomes. The passage of time provides labels that require no annotation. To exploit this structure, we extend reinforcement learning with verifiable rewards to real-world prediction over time. We train LLMs to make probabilistic forecasts from causally masked information, using proper scoring rules as the reward function once events resolve. Learning is driven entirely by realized outcomes, enabling scalable outcome-based supervision in open-world prediction. On real-world forecasting benchmarks, Qwen3-32B trained using Foresight Learning improves Brier score by 27% and halves calibration error relative to its pretrained baseline, and outperforms Qwen3-235B on both constructed future-event prediction tasks and the Metaculus benchmark despite a 7x parameter disadvantage.

Summary

  • The paper introduces outcome-based supervision that uses post-event outcomes as labels, overcoming delayed and sparse feedback in forecasting.
  • It distinguishes between a predictor and a resolver, using Group Relative Policy Optimization to assign credit and reduce variance in forecast training.
  • Empirical evaluations demonstrate a 27% Brier score improvement and superior calibration on both synthetic benchmarks and external datasets.

Future-as-Label: Outcome-Based Supervision for Real-World Prediction

Problem Setting and Motivation

Forecasting real-world events with LLMs fundamentally suffers from the absence of immediate, contemporaneously available supervision. In many domains—such as judicial decisions, geopolitical events, corporate actions—the ground truth lag renders conventional supervised learning inapplicable, as labels are only available post hoc. Standard RL with verifiable rewards (RLVR) methods rely on reward signals that can be computed contemporaneously with prediction, naturally limiting them to domains with closed-form, automatable verification (e.g., mathematics, code, or formal proofs).

The paper "Future-as-Label: Scalable Supervision from Real-World Outcomes" (2601.06336) formalizes and operationalizes outcome-based supervision derived retroactively from the actualization of future events. This formulation generalizes RLVR to settings with open-world uncertainty, sparse feedback, and intrinsic delays between prediction and resolution, establishing a framework for supervised policy optimization aligned with the true causal structure of real-world forecasting.

Methodology

Causal Structure and Roles

The learning protocol distinguishes between a "predictor" and a "resolver":

  • Predictor: The trainable LLM, which, at time tt, only observes a temporally masked information state (all data timestamped no later than tt). It emits a probabilistic prediction p(0,1)p\in(0,1).
  • Resolver: An external, frozen process (in this work, a frozen LLM) with full access to post-tt information, tasked exclusively with resolving ground truth y{0,1}y\in\{0,1\} after event actualization at some s>ts>t.

There is strict causal separation: the predictor receives no outcome information during prediction or learning, and the resolver cannot access predictor outputs, ensuring no leakage or endogenous reward shaping.

Objective and Optimization

Supervision is derived exclusively from post-resolution outcomes, with the per-episode terminal reward given by the log-score (proper scoring rule):

R(p,y)=ylogp+(1y)log(1p)R(p, y) = y\log p + (1-y)\log(1-p)

While this superficially resembles supervised likelihood optimization, the formulation is fundamentally RL-style: the predictor samples reasoning trajectories under masked state, credit is attributed by the terminal outcome with no intermediate rewards, and optimization proceeds via policy gradient, specifically Group Relative Policy Optimization (GRPO), which reduces estimator variance by groupwise subtraction of the within-group mean reward.

Training and Data Generation Protocol

  • Dataset Construction: Events are paired with all information available up to a strict cutoff tt, prohibiting any post hoc leakage. Outcomes are resolved automatically using a frozen, post-hoc LLM given exhaustive post-tt sources. Training, test, and evaluation sets are strictly partitioned in time.
  • Masking: All model inputs are timestamp-masked—to enforce the predictor only sees what would be knowable as of the forecast moment.
  • Evaluation: Metrics include log-score, Brier score, and expected calibration error (ECE), across both a synthetic, tightly controlled future-event benchmark and an independent Metaculus dataset.

Empirical Findings

The Qwen3-32B model trained with Foresight Learning under outcome-based supervision demonstrates:

  • 27% improvement in Brier score and a halving of calibration error relative to the base pretrained model when evaluated on held-out synthetic event benchmarks.
  • Consistent outperformance relative to both model scale and sampling-based baselines: Qwen3-32B with Foresight Learning exceeds the accuracy and calibration of the much larger Qwen3-235B model, indicating that gains stem from the training objective rather than scale or ensembling.
  • Generalization to external benchmarks: On independently sourced Metaculus questions, Foresight-trained models retain strong calibration and accuracy, despite domain shift.
  • Variance ablation: Pure ensembling/generation of multiple probabilistic outputs does not close the gap, underscoring the learning benefit of outcome-aligned supervision.

Significance and Implications

This work operationalizes RL-style training under real-world, outcome-based feedback, bridging the full temporal gap between forecast and resolution—a regime that is unavoidable in open-world deployment. The demonstrated gains in both accuracy and (notably) calibration highlight the centrality of outcome-centric policy shaping for reliable forecasting, as proper scoring rules effectively penalize overconfidence and reward uncertainty quantification.

From a theoretical perspective, Foresight Learning establishes a framework for scalable RL in environments with delayed, sparse, and exogenous feedback. By leveraging group-relative credit assignment, it mitigates the sample inefficiency that plagues naive policy gradients under sparse rewards. The method is robust to scale and outperforms sampling heuristics, suggesting that—even in large-model regimes—training objective selection can have impacts exceeding those from model scaling alone.

Limitations and Future Work

  • Offline training: All feedback is derived from pre-resolved events. Extending to online and continually updating forecasters that autonomously collect and resolve labels would require integrating deployment-time feedback.
  • Automated Question/Resolution Biases: The fully automated pipeline for event and outcome specification, while reproducibility-friendly, is subject to data coverage and resolver error risks.
  • Outcome Space: The work is restricted to binary outcomes; extension to continuous, categorical, or complex structured labels (e.g., distributions, free text) remains open.

Conclusion

The paper provides a formal and empirical advance in outcome-aligned model training, showing that direct supervision from temporally resolved, real-world outcomes yields superior accuracy and calibration over both scaling and sampling-based SFT alternatives. This paradigm is essential for scaling LLMs into open-world domains, and points toward the broader adoption of outcome-based RL objectives for robust, deployable future-event prediction models. Further extension to richer event structures and online learning scenarios is an important next step for the field.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Knowledge Gaps

Unresolved knowledge gaps, limitations, and open questions

Below is a concise, actionable list of what remains uncertain or unexplored in the paper and where future work can concretely extend or validate the approach:

  • Bold: Missing head-to-head with supervised learning. The paper argues RL with proper scoring rules differs from supervised likelihood training but does not compare against simple MLE/SFT baselines trained on the same resolved outcomes; run controlled experiments (same data, prompts, and masking) comparing RL vs MLE on log score, Brier, and calibration.
  • Bold: Optimizer and algorithmic ablations. GRPO is used without ablations; evaluate PPO/REINFORCE, different baselines, group sizes K, and per-trajectory sampling counts to quantify stability, variance reduction, and sample efficiency.
  • Bold: Reward choice sensitivity. Training uses the log score but evaluates also with Brier; assess training with Brier, spherical score, mixed objectives, or risk-sensitive utilities to understand trade-offs in calibration vs sharpness.
  • Bold: Effect of clamping probabilities. Predictions are clipped to [0.001, 0.999]; measure how clipping affects calibration, extreme-event performance, and log-score gradients, and test softer regularization alternatives.
  • Bold: Dataset scale and scaling laws. Only ~5.1k training events are used; map performance vs dataset size (e.g., 1k/2k/5k/10k+) to estimate sample complexity and returns to additional resolved events.
  • Bold: Horizon length generalization. Horizons are “days to several weeks”; evaluate months-to-years horizons and analyze performance as a function of horizon length to identify where outcome-based supervision breaks down.
  • Bold: Outcome space limitations. Experiments are binary; implement and benchmark continuous (e.g., real-valued targets), multi-class, ordinal, survival/time-to-event, and free-text outcomes with appropriate scoring rules and resolvers.
  • Bold: Resolver reliability and bias. The resolver (Gemini-2.5-Flash) determines ground-truth outcomes, but error rates and biases are unquantified; perform human adjudication audits, inter-rater agreement studies, and sensitivity analyses to label noise.
  • Bold: Selection bias from discarded cases. Low-confidence or unresolved events are removed; quantify and report characteristics of discarded vs retained events to assess bias in difficulty, domain, salience, and horizon.
  • Bold: Temporal masking fidelity. Masking relies on publisher timestamps and exclusion rules; audit for leakage via late updates, timezone errors, syndicated content, or cached pages, and publish a leakage test harness.
  • Bold: Information-state construction under-specified. How “relevant pre-t text” is selected (retrieval strategy, filters, context budget) is not specified; document and ablate retrieval choices, and compare to retrieval-augmented baselines at inference.
  • Bold: Limited baseline coverage. Baselines omit strong alternatives such as post-hoc calibration (Platt/isotonic/temperature scaling), retrieval-augmented prompting, and supervised training on curated historical forecasts; add these for a stricter test.
  • Bold: Ensemble baseline design. Using a median of 7 samples may understate ensembling; compare to means, trimmed means, quantile aggregation, and larger sample budgets (e.g., 32–128) with calibration layers.
  • Bold: Architecture dependence. Results are shown for Qwen3-32B (and a prompted 235B); replicate on diverse base models (e.g., Llama, Mixtral, GPT-style) and sizes to assess generality and dependence on pretraining quality.
  • Bold: Compute and efficiency reporting. Training cost (tokens, FLOPs, wall-clock) and reward throughput are not reported; provide cost vs gain analyses to evaluate practical viability versus SFT or calibration methods.
  • Bold: Statistical significance and uncertainty. Tables lack confidence intervals and hypothesis tests; report bootstrap CIs and significance for all metrics, across multiple seeds, to rule out luck or seed sensitivity.
  • Bold: Generalization beyond English news. Training uses English-language news; evaluate cross-lingual settings, low-resource languages, non-news sources (e.g., filings, transcripts), and multimodal evidence to test breadth.
  • Bold: Distribution shift robustness. Test sets are temporally disjoint but similar in construction; evaluate OOD shifts by topic, region, event rarity, and surprise level, and report per-domain error analyses.
  • Bold: Online learning and deployment feedback. Training is offline; prototype and evaluate online outcome-driven updates, delayed credit assignment pipelines, and safeguards for non-stationarity.
  • Bold: Reflexivity and feedback loops. In deployment, forecasts can influence outcomes; study stability of outcome-based rewards under reflexive environments and propose mitigations (e.g., counterfactual labeling, intervention detection).
  • Bold: Sequential updating and multi-step episodes. Episodes are single-shot; extend to rolling forecasts with multiple updates before resolution and assess whether outcome-based RL improves update discipline and coherency.
  • Bold: Sharpness vs calibration trade-offs. ECE improves, but sharpness/resolution is not analyzed; report Murphy decomposition of Brier, reliability diagrams, and average confidence to ensure gains aren’t from overly conservative probabilities.
  • Bold: Error analysis by horizon/domain/base rate. Provide breakdowns of gains and failures by event type, base rate bins, and horizon to identify where the method helps or harms.
  • Bold: Adversarial and ambiguous events. Stress-test with ill-posed, adversarially phrased, or low-salience events to measure resolver and policy robustness.
  • Bold: Reproducibility of training. Datasets and weights are released, but training code and exact prompts/pipelines for generator, predictor, and resolver are not clearly provided; release full code, prompts, and config to enable faithful replication.
  • Bold: Ethical and governance considerations. Outcome verification via a single LLM may embed systematic biases; evaluate fairness across topics (e.g., legal outcomes, geopolitics) and consider multi-resolver adjudication protocols.
  • Bold: Utility-aware training. Only proper scoring rules are considered; explore asymmetric or application-specific utilities where false positives/negatives incur different costs, and test whether outcome-based RL can optimize such utilities.
  • Bold: Extreme-event forecasting. Assess performance on low-probability, high-impact events and tail calibration, including whether clipping and log-score training under- or over-penalize extremes.
  • Bold: Interactions with chain-of-thought. The model generates internal trajectories, but their role is not probed; ablate visible vs hidden reasoning, trajectory length constraints, and reasoning regularizers to see what drives gains.

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 4 tweets with 19 likes about this paper.