Large Event Models (LEMs) Overview

Updated 20 August 2025

Large Event Models (LEMs) are structured computational systems designed to capture and represent complex real-world events through schema induction, integrating techniques from language models and probabilistic reasoning.
LEMs employ methodologies such as zero-shot schema generation, one-pass transformer models, and neurosymbolic integration to generate hierarchical, logic-grounded event representations.
They are applied across diverse domains—from narrative understanding and human mobility forecasting to sensor-based activity detection—while addressing challenges like hallucination and evaluation misalignment.

Large Event Models (LEMs) are structured computational systems designed to capture, represent, generalize, and reason about the complex architectures of real-world events. Integrating techniques from LLMs, probabilistic graphical models, structured knowledge representation, and innovative data acquisition methods, LEMs enable automated schema induction, event prediction, and high-level inference that align with both symbolic knowledge and latent statistical regularities. LEMs have been realized in diverse domains, including commonsense narrative understanding, human mobility forecasting under public events, sensor-based activity abstraction, and large-scale event forecasting. This article presents a detailed review of methodologies, representational choices, training paradigms, and limitations that define the state of the art in LEM research.

1. Schema Induction and Logical Representation

A foundational principle for LEMs is schema induction: the discovery and formalization of structured templates describing the core constituents, participants, and temporal or causal relations of complex events. Early systems such as NESL ("Mining Logical Event Schemas From Pre-Trained LLMs" (Lawley et al., 2022)) demonstrated a modular pipeline integrating:

Prompt-based sampling from a LLM (e.g., GPT-J 6B), generating “situation samples” or short narratives,
FrameNet-driven neural semantic parsing (e.g., LOME) to extract event frames,
Mapping into Episodic Logic (EL): an event representation where an episode $E$ is characterized by a formula $\varphi$ as $(\varphi ** E)$ , supporting first-class reasoning over temporal, causal, and participant roles,
Bootstrapped protoschemas that serve as seed behaviors for generalized event schema construction,
Vector embedding and clustering of logical schemata via aggregation over argument embeddings, e.g.,

$v_s(\varphi, S) = v_w(\varphi_{\text{verb}}) + \bigoplus_{i} [ v_a(\varphi_{\text{arg}_i}, S) ]$

where $v_w$ denotes word embedding and $\bigoplus$ is elementwise mean.

This pipeline enables moving beyond surface-level event tuples, yielding hierarchically organized and logic-grounded event schemas that support inference over temporal and causal structures.

2. Zero-Shot and Retrieval-Augmented Schema Generation

Recent approaches have shifted toward “on-the-fly” and zero-shot schema induction, removing dependencies on manually curated corpora or ontologies. In "Zero-Shot On-the-Fly Event Schema Induction" (Dror et al., 2022), a framework is introduced in which:

Large pretrained LMs (e.g., GPT-3) are prompted with diverse templates (“Write a headline…”, “List the steps for…”) to produce textual accounts of a given event type.
A multi-stage IE pipeline employing SRL, NER, coreference resolution, and constituency parsing extracts events, arguments, and relation candidates.
Efficient “One-Pass” transformers (BigBird backbone) replace traditional $O(n^2)$ pairwise relation classifiers, computing all temporal and hierarchical relations in a single forward pass:

$D = [t_1, ..., e_1, ..., e_2, ..., t_n]$

with contextualized event representations $r_e$ processed via an MLP to classify relation types, yielding cross-entropy loss over all event pairs.

Schematic graphs are constructed by aggregating events, chaining temporal tuples, incorporating hierarchical organization, and reconciling conflicting orderings using logical operators (e.g., AND/OR).

Empirical validation against human-curated schemas demonstrates not only competitive coverage and relational fidelity but also surprising completeness and generality in the induced schema graphs. Efficient extraction of event relations via the One-Pass paradigm enables scalable schema generation for unseen event types and rapid expansion of coverage.

3. Event Sequence Modeling and Probabilistic Structures

LEMs are increasingly coupled with probabilistic frameworks that facilitate event prediction, pattern mining, and scenario simulation. In "Distilling Event Sequence Knowledge From LLMs" (Wadhwa et al., 14 Jan 2024):

Sequence generation is formalized as

$P_{\text{LM}}(y | T) = \prod_{t=2}^{N} P(y_t | T, y_1, ..., y_{t-1})$

using iterative in-context few-shot prompting guided by a knowledge graph (KG) of event concepts and partial causal relations.

LLMs are restricted to a fixed KG vocabulary, and prompts are constructed based on causal relations (e.g., "What usually follows an earthquake?").
The resulting sequences are mined for frequent causal patterns (e.g., "Famine → Refugee Crisis → PTSD") using classic mining algorithms (GSP, SPADE).
Downstream, probabilistic sequence models—specifically summary Markov models (SuMMs) such as Binary and Ordinal SuMMs—estimate the likelihood of event $e$ given past history, accounting for set-based memory or order of prior events.

This distillation of latent sequence knowledge fills gaps in event KGs and supports predictive analytics in domains requiring inference over event chains, such as forecasting in finance or healthcare.

4. Extraction Accuracy, Hallucination, and Semantic Evaluation

As LEMs scale, evaluation criteria and extraction robustness are central. Token-level exact match is recognized as a poor proxy for semantic correctness, especially given the varied phrasings produced by generative models. The RAEE framework ("Beyond Exact Match: Semantically Reassessing Event Extraction by LLMs" (Lu et al., 12 Oct 2024)) proposes:

LLM-judged, chain-of-thought–prompted semantic evaluation, determining if predicted event triggers/arguments are semantically equivalent to gold labels, regardless of lexical/exact span match.
Adaptive criteria embedded in prompts: e.g., acceptance of core word, co-reference, or contextually valid alternatives.
RAEE’s metrics:

$\mathbf{p} = \frac{|P_c|}{|P|}, \quad \mathbf{r} = \frac{|G_r|}{|G|}, \quad \mathbf{f1_{RAEE}} = \frac{2 \cdot p \cdot r}{p + r}$

where $|P_c|$ and $|G_r|$ count semantically judged correct predictions and recalls.

Experiments indicate that EM metrics significantly underestimate LEM performance—especially for LLMs—and that RAEE scores align more closely with human judgments. Fine-grained error analyses (e.g., frequent WrongType errors) reveal systemic challenges in span typing and argument classification.

Mitigating hallucination in event extraction has motivated decomposed pipelines ("Decompose, Enrich, and Extract! Schema-aware Event Extraction using LLMs" (Shiri et al., 3 Jun 2024)), where Event Detection (ED) and Event Argument Extraction (EAE) are sequentially solved with tailored, retrieval-augmented prompts, reducing context confusion and hallucination rates compared to monolithic prompting.

5. Integration with Multimodal and Real-World Data

LEMs have demonstrated practical utility beyond text, including sensor-driven environments and human mobility analysis. In "LLM-based event abstraction and integration for IoT-sourced logs" (Shirali et al., 5 Sep 2024):

Binary sensor readings are mapped to high-level activity labels via few-shot chain-of-thought prompting, with the LLM classifying state deltas (e.g., $f(\Delta S)$ ) as activities or null events.
Multi-modality integration is achieved by sequentially batching sensor outputs (e.g., ambient, wristband, smartphone) and aligning them into a single, process-mining–ready event log.
Alignment with ground-truth event logs demonstrates high accuracy (up to 94% EDA alignment for certain days), and the approach generalizes well to real-time and scalable settings given appropriate prompt engineering.

Similarly, in human mobility forecasting under public events ("Exploring LLMs for Human Mobility Prediction under Public Events" (Liang et al., 2023), "Event-aware analysis of cross-city visitor flows using LLMs and social media data" (Wang et al., 5 May 2025)), LLM pipelines are used to extract structured event information and online popularity metrics (e.g., overall, promotional, and word-of-mouth social media-based popularity), which serve as predictors in rolling GBDT models yielding R² > 85% in predicting daily visitor flows. This event-aware modeling enables granular policy guidance for transportation and public event management.

6. Event Reasoning, Symbolic Synergy, and Future Forecasting

LEMs have evolved from descriptive schema induction to support complex event reasoning, counterfactual inference, and probabilistic forecasting:

Schema-level event graphs $G^{(s)} = (V^{(s)}, \mathcal{E}^{(s)})$ , with instance-level graphs $G^{(i)}$ , support reasoning over {Causes, IsResult, Before, After, IsSubevent, HasSubevent} relations (Tao et al., 26 Apr 2024). Two core paradigms—Contextual Event Classification (CEC) and Contextualized Relation Reasoning (CRR)—are formalized as e.g., $e_a = M(G, C, e_q, r_q)$ and $r = M(e_i, e_j, G)$ . Evaluations show moderate instance-level event reasoning by LLMs (e.g., ~63% accuracy for GPT-4), with significant imbalances across relation types and improved performance when guided by explicated schema information (Direct Guidance, Chain-of-Thought Guidance).
Synergistic neurosymbolic paradigms emerge in "Structured Event Reasoning with LLMs" (Zhang, 28 Aug 2024), combining:
- Language-based LEMs (fine-tuned on sub-event relations and temporal order, as in wikiHow-derived datasets, with likelihood change formulas $\delta_i = p(e_j | s_i \dots s_1, G) - p(e_j | s_{i-1} \dots s_1, G)$ ),
- Semi-symbolic LEMs (predicting entity states via few-shot prompting, e.g., resolving whether event $event\_can\_happen(safe\_to\_touch\_the\_pan)$ is possible based on $entity\_has\_attribute(pan, cool)$ ),
- Fully symbolic LEMs (generating PDDL code from text, enabling use of external planners and verifiable world models).
- Empirically, these structured approaches consistently outperform end-to-end LLMs, with higher accuracy and improved interpretability evaluated using domain-relevant benchmarks (e.g., CREPE macro F1, planning task solution rates).
In event forecasting (e.g., "Advancing Event Forecasting through Massive Training of LLMs: Challenges, Solutions, and Broader Impacts" (Lee et al., 25 Jul 2025)), the focus is on scaling training with market/public/crawled datasets, resolving issues of noisy/sparse outcomes, knowledge cut-off, and reward drift. Hypothetical event Bayesian networks, counterfactual scenarios, auxiliary subquestion rewards, and nontrivial benchmarking (including KL-regularization and process-guided RL) are key research directions. Societal impacts include improved decision support, policy analysis, and algorithmic trading, contingent on robust probabilistic calibration and reasoning capabilities.

7. Applications and Limitations

LEMs have found application in numerous domains:

Soccer analytics: sequential event prediction and match simulation using WyScout data ("Estimating Player Performance in Different Contexts Using Fine-tuned Large Events Models" (Mendes-Neves et al., 9 Feb 2024); "Forecasting Events in Soccer Matches Through Language" (Mendes-Neves et al., 9 Feb 2024)), enabling context-dependent evaluation of player transfers, tactical adjustments, and event-driven analytic pipelines through sequence models paralleling LLM-generation frameworks.
Biomedical and BCI settings: large EEG models fine-tuned for stress detection (LaBraM in (2505.23042)) have demonstrated high balanced accuracy (up to 90.47%) and robust transfer from pretraining to real-world data, underscoring a paradigm shift from model-centric to data-centric BCI design.

Key limitations include hallucination in generative extraction, misalignment between LEM inference behavior and human commonsense, overfit or overconfidence in forecasting when trained with simple reward structures, and domain-specific evaluation gaps when token-level metrics are misapplied.

In summary, Large Event Models (LEMs) synthesize schema-based induction, probabilistic modeling, reasoning, and efficient data-driven extraction to support interpretable, scalable, and high-fidelity modeling of complex events. Advances in decomposition, retrieval-augmented prompting, semantic evaluation, and symbolic integration continue to define the trajectory of this field, as do efforts to overcome inherent limitations in calibration, representation, and ground-truth alignment. These systems represent a convergence of symbolic logic, neural sequence learning, and knowledge representation, establishing a toolkit for robust event understanding and predictive reasoning in AI.