Grading Methodologies for Long-Form Forecasts

Develop rigorous, reliable grading and scoring procedures for long-form natural-language forecasts to enable their evaluation and comparison across systems and datasets.

Background

The work focuses on short-answer, open-ended forecasting and adapts Brier-style scoring to free-form responses, enabling reinforcement learning with outcome-based rewards.

However, evaluating long-form forecasts—extended natural-language predictions and reasoning—poses unresolved challenges because there is no established method to grade such outputs, which limits the scope of current forecasting research and system development.

References

We also do not consider long-form forecasts, as it is unclear how to grade these.

Scaling Open-Ended Reasoning to Predict the Future  (2512.25070 - Chandak et al., 31 Dec 2025) in Conclusion, Section 6