Future-as-Label: Scalable Supervision from Real-World Outcomes
Abstract: Time creates free supervision: forecasts about real-world events resolve to verifiable outcomes. The passage of time provides labels that require no annotation. To exploit this structure, we extend reinforcement learning with verifiable rewards to real-world prediction over time. We train LLMs to make probabilistic forecasts from causally masked information, using proper scoring rules as the reward function once events resolve. Learning is driven entirely by realized outcomes, enabling scalable outcome-based supervision in open-world prediction. On real-world forecasting benchmarks, Qwen3-32B trained using Foresight Learning improves Brier score by 27% and halves calibration error relative to its pretrained baseline, and outperforms Qwen3-235B on both constructed future-event prediction tasks and the Metaculus benchmark despite a 7x parameter disadvantage.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Knowledge Gaps
Unresolved knowledge gaps, limitations, and open questions
Below is a concise, actionable list of what remains uncertain or unexplored in the paper and where future work can concretely extend or validate the approach:
- Bold: Missing head-to-head with supervised learning. The paper argues RL with proper scoring rules differs from supervised likelihood training but does not compare against simple MLE/SFT baselines trained on the same resolved outcomes; run controlled experiments (same data, prompts, and masking) comparing RL vs MLE on log score, Brier, and calibration.
- Bold: Optimizer and algorithmic ablations. GRPO is used without ablations; evaluate PPO/REINFORCE, different baselines, group sizes K, and per-trajectory sampling counts to quantify stability, variance reduction, and sample efficiency.
- Bold: Reward choice sensitivity. Training uses the log score but evaluates also with Brier; assess training with Brier, spherical score, mixed objectives, or risk-sensitive utilities to understand trade-offs in calibration vs sharpness.
- Bold: Effect of clamping probabilities. Predictions are clipped to [0.001, 0.999]; measure how clipping affects calibration, extreme-event performance, and log-score gradients, and test softer regularization alternatives.
- Bold: Dataset scale and scaling laws. Only ~5.1k training events are used; map performance vs dataset size (e.g., 1k/2k/5k/10k+) to estimate sample complexity and returns to additional resolved events.
- Bold: Horizon length generalization. Horizons are “days to several weeks”; evaluate months-to-years horizons and analyze performance as a function of horizon length to identify where outcome-based supervision breaks down.
- Bold: Outcome space limitations. Experiments are binary; implement and benchmark continuous (e.g., real-valued targets), multi-class, ordinal, survival/time-to-event, and free-text outcomes with appropriate scoring rules and resolvers.
- Bold: Resolver reliability and bias. The resolver (Gemini-2.5-Flash) determines ground-truth outcomes, but error rates and biases are unquantified; perform human adjudication audits, inter-rater agreement studies, and sensitivity analyses to label noise.
- Bold: Selection bias from discarded cases. Low-confidence or unresolved events are removed; quantify and report characteristics of discarded vs retained events to assess bias in difficulty, domain, salience, and horizon.
- Bold: Temporal masking fidelity. Masking relies on publisher timestamps and exclusion rules; audit for leakage via late updates, timezone errors, syndicated content, or cached pages, and publish a leakage test harness.
- Bold: Information-state construction under-specified. How “relevant pre-t text” is selected (retrieval strategy, filters, context budget) is not specified; document and ablate retrieval choices, and compare to retrieval-augmented baselines at inference.
- Bold: Limited baseline coverage. Baselines omit strong alternatives such as post-hoc calibration (Platt/isotonic/temperature scaling), retrieval-augmented prompting, and supervised training on curated historical forecasts; add these for a stricter test.
- Bold: Ensemble baseline design. Using a median of 7 samples may understate ensembling; compare to means, trimmed means, quantile aggregation, and larger sample budgets (e.g., 32–128) with calibration layers.
- Bold: Architecture dependence. Results are shown for Qwen3-32B (and a prompted 235B); replicate on diverse base models (e.g., Llama, Mixtral, GPT-style) and sizes to assess generality and dependence on pretraining quality.
- Bold: Compute and efficiency reporting. Training cost (tokens, FLOPs, wall-clock) and reward throughput are not reported; provide cost vs gain analyses to evaluate practical viability versus SFT or calibration methods.
- Bold: Statistical significance and uncertainty. Tables lack confidence intervals and hypothesis tests; report bootstrap CIs and significance for all metrics, across multiple seeds, to rule out luck or seed sensitivity.
- Bold: Generalization beyond English news. Training uses English-language news; evaluate cross-lingual settings, low-resource languages, non-news sources (e.g., filings, transcripts), and multimodal evidence to test breadth.
- Bold: Distribution shift robustness. Test sets are temporally disjoint but similar in construction; evaluate OOD shifts by topic, region, event rarity, and surprise level, and report per-domain error analyses.
- Bold: Online learning and deployment feedback. Training is offline; prototype and evaluate online outcome-driven updates, delayed credit assignment pipelines, and safeguards for non-stationarity.
- Bold: Reflexivity and feedback loops. In deployment, forecasts can influence outcomes; study stability of outcome-based rewards under reflexive environments and propose mitigations (e.g., counterfactual labeling, intervention detection).
- Bold: Sequential updating and multi-step episodes. Episodes are single-shot; extend to rolling forecasts with multiple updates before resolution and assess whether outcome-based RL improves update discipline and coherency.
- Bold: Sharpness vs calibration trade-offs. ECE improves, but sharpness/resolution is not analyzed; report Murphy decomposition of Brier, reliability diagrams, and average confidence to ensure gains aren’t from overly conservative probabilities.
- Bold: Error analysis by horizon/domain/base rate. Provide breakdowns of gains and failures by event type, base rate bins, and horizon to identify where the method helps or harms.
- Bold: Adversarial and ambiguous events. Stress-test with ill-posed, adversarially phrased, or low-salience events to measure resolver and policy robustness.
- Bold: Reproducibility of training. Datasets and weights are released, but training code and exact prompts/pipelines for generator, predictor, and resolver are not clearly provided; release full code, prompts, and config to enable faithful replication.
- Bold: Ethical and governance considerations. Outcome verification via a single LLM may embed systematic biases; evaluate fairness across topics (e.g., legal outcomes, geopolitics) and consider multi-resolver adjudication protocols.
- Bold: Utility-aware training. Only proper scoring rules are considered; explore asymmetric or application-specific utilities where false positives/negatives incur different costs, and test whether outcome-based RL can optimize such utilities.
- Bold: Extreme-event forecasting. Assess performance on low-probability, high-impact events and tail calibration, including whether clipping and log-score training under- or over-penalize extremes.
- Bold: Interactions with chain-of-thought. The model generates internal trajectories, but their role is not probed; ablate visible vs hidden reasoning, trajectory length constraints, and reasoning regularizers to see what drives gains.
Collections
Sign up for free to add this paper to one or more collections.