This paper introduces Mirai, a benchmark designed to evaluate the capabilities of LLM agents in forecasting international events (Ye et al., 1 Jul 2024 ). The authors argue that while LLM agents show promise in autonomously gathering information and reasoning, their effectiveness in the complex domain of international event forecasting lacks rigorous evaluation. Existing methods often rely on single data types (knowledge graphs or text) and lack transparency in their reasoning.
Mirai Benchmark:
- Data and Task:
- Mirai uses data derived from the Global Database of Events, Language, and Tone (GDELT), carefully pre-processed and cleaned. It includes structured events and textual news articles from January 1, 2023, to November 30, 2023.
- Events are represented as quadruples , where is the timestamp (date), and are subject and object countries (ISO-3166 codes), and is the relation type based on the CAMEO ontology (using both first-level two-digit and second-level three-digit codes).
- The forecasting task is defined as predicting the set of relations between countries and that will occur days in the future, given all historical information up to time . Queries are formulated like .
- Agentic Environment and APIs:
- Mirai provides an environment where LLM agents interact with the database using a code-based interface through APIs.
- Agents use a ReAct-style (Think, Act, Observe) iterative process.
- The API includes data classes (Date, ISOCode, CAMEOCode, Event, etc.) and functions to query historical events, news articles, country/relation information, and distributions. Functions support filtering by date range, entities, relations, and text descriptions.
- Two action types are supported for the "Act" step:
- Single Function: Executes a single, predefined API function call.
- Code Block: Executes a multi-line Python code snippet, allowing complex logic, loops, conditionals, and use of libraries like
numpy
,networkx
,scikit-learn
.
- The environment executes the generated code in a sandbox and returns the output (or error message) as the "Observation".
- Database Construction:
- GDELT data was filtered (Jan-Nov 2023), standardized (ISO codes, CAMEO second-level), cleaned (removing low-quality/domestic events, >50 mentions threshold), and aligned with news publish dates.
- News articles were downloaded and cleaned using the OBELICS protocol.
- The final database contains ~992k GDELT records (59k unique events) and ~297k news articles.
- A test set of 705 queries was constructed from November 2023 data using stricter filtering (>=100 mentions, >=5 news articles), resulting in 2,136 unique events for ground truth answers. A balanced subset of 100 queries was also created.
- Evaluation Metrics:
- Forecasts are evaluated using Precision, Recall, and F1 scores for both first-level and second-level predicted CAMEO codes against the ground truth.
- Kullback-Leibler (KL) divergence is used to measure the discrepancy between the predicted and ground-truth distributions over binary (Conflict/Cooperation) and quadratic (Verbal/Material Conflict/Cooperation) relation classes.
Experiments and Findings:
- Agent Performance:
- Temporal event forecasting in Mirai is challenging; the best agent (GPT-4o with full API access) achieved only a 29.6 F1 score on second-level relation prediction.
- Predicting fine-grained (second-level) relations is harder than first-level relations. Long-term forecasting (larger ) also significantly degrades performance.
- Tool-use (ReAct agents) significantly outperforms non-tool-use baselines (Direct IO, ZS-CoT), highlighting the need for grounding forecasts in retrieved data.
- Access to both structured event data and textual news data yields the best results, though news data alone performs poorly, potentially due to noise and long context issues.
- Base LLMs and Action Types:
- GPT-4o consistently outperformed GPT-4-Turbo, GPT-3.5-Turbo, and Mistral-7B.
- The "Code Block" action type improved performance for GPT-4 models but hurt performance for GPT-3.5 and Mistral-7B, indicating that effective use of this flexible but complex action space requires strong code generation capabilities.
- Code execution errors (e.g., invalid dates, invalid attributes) were frequent, especially for weaker models like Mistral-7B. GPT-4o exhibited significantly fewer errors.
- Analysis:
- Self-Consistency: Applying self-consistency sampling significantly boosted the performance of Mistral-7B, showing potential for inference-time search methods.
- Temporal Distance: Performance degrades as the forecasting horizon () increases (tested for =1, 7, 30, 90 days). Long-term forecasting (30/90 days) poses a greater challenge.
- Relation Types: Agents performed better at predicting "verbal cooperation" (more frequent) and "material conflict" (more persistent) events compared to the more abrupt and less predictable "material cooperation" and "verbal conflict" events.
- Tool-Use Ordering: Analysis of GPT-4o's action sequences revealed common patterns (e.g., starting with
get_relation_distribution
orget_event
, ending withbrowse_news_article
). Strategic sequences (e.g., usingget_news_articles
followed bybrowse_news_article
) led to better outcomes, emphasizing the importance of planning in tool use.
Conclusion:
Mirai provides a challenging benchmark for evaluating LLM agents on temporal event forecasting. Current agents struggle, especially with fine-grained, long-term predictions and complex code generation for tool use. The benchmark highlights the need for improvements in temporal reasoning, robust tool use, and strategic planning for LLM agents. The authors provide the dataset, code, and an interactive demo to facilitate further research.