Large Emotional World Model (LEWM)
- Large Emotional World Model (LEWM) is a framework that integrates both physical and emotional state transitions for sequential decision-making.
- The model employs a two-stage factorization and coupled transition heads to predict emotion-driven changes alongside physical world dynamics.
- Empirical results on the EWH dataset demonstrate that incorporating explicit emotional cues improves prediction stability and social reasoning accuracy.
Large Emotional World Model (LEWM) is a modeling framework that extends conventional world models by systematically incorporating emotion as a core explanatory and predictive factor in sequential decision-making environments. Unlike standard models that prioritize physical-world regularities, LEWM represents state transitions as involving both world states and emotional states, enabling high-fidelity prediction of both objective and subjective social behaviors across complex multimodal scenes (Song et al., 30 Dec 2025).
1. Problem Formulation and Mathematical Framework
The LEWM formalism augments the standard agent-world interaction loop. At discrete time , the agent's observation (video, audio, image), action (natural-language), and emotional state (compact embedding via facial expression and context) jointly mediate transitions to next states governed by:
LEWM factorizes the conditional joint over transitions as:
with as the fused latent conditioning vector. This two-stage factorization models emotion as an explicit modulator for world-state progressions, capturing theory of mind intuitions central to social reasoning.
2. Emotion-Why-How (EWH) Dataset Construction
LEWM leverages the Emotion-Why-How (EWH) dataset, constructed from a large corpus of real-world multimodal social scenes including movie clips, TV shows, and first-person recordings. Data annotation circumvents manual labeling by utilizing a pretrained large multimodal model (LMM) to:
- Identify segments where emotion causally drives behavior.
- Generate natural-language behavioral descriptions ().
- Infer emotional state () using semantic and facial-expression cues.
State aggregates synchronized key frames (), audio (), and image (). Each tuple is constructed as , with causal annotations:
- Why:
- How:
Emotion categories span the canonical discrete set: joy, sadness, anger, fear, surprise, disgust. The dataset comprises – tuples across diverse social contexts, capturing both the causality behind actions ("why") and the mechanism of emotional transitions ("how").
3. Model Architecture and Module Design
LEWM consists of four principal components:
- Visual Encoder (): Processes multimodal input via
- CNN/transformer for images,
- 1D-CNN or transformer for audio,
- 3D-CNN or ViT for video frames. Outputs are pooled into a latent vector .
- Action Encoder (): Encodes natural-language descriptions using a transformer or Bi-LSTM into vector .
- Emotion Encoder (): Integrates emotion label and facial expression embedding via a two-layer MLP to produce .
- Coupled Transition Heads:
- Emotion Transition Head (): .
- State Transition Head (): .
- State Decoder (): .
Attention layers facilitate multimodal fusion. The architecture avoids explicit recurrence, instead stacking predictions for sequential rollouts.
4. Training Objectives and Loss Structure
Training minimizes a composite objective integrating reconstruction fidelity and emotional accuracy:
- State Reconstruction Loss:
where is a learned feature mapping (e.g., via a visual backbone).
- Emotion Prediction Loss:
(cross-entropy over discrete classes).
Combined joint loss:
with controlling task weight.
Additionally, an emotion-consistency regularizer,
(where is a small perturbation), discourages spurious world-state changes under minor emotional shifts. Final objective:
( regulates state–emotion coupling).
5. Inference, Rollouts, and Evaluation Procedure
Inference with LEWM follows an autoregressive schema. Given initial and an action sequence , progression for to entails:
- Encode .
- Predict emotion .
- Predict world-state latent .
- Decode multimodal output .
- Set .
This yields predicted audiovisual–emotional trajectories. Evaluation benchmarks include:
| Task Type | Metric | Observed Impact (LEWM) |
|---|---|---|
| Emotion-driven prediction | Emotion accuracy / macro-F1 | +6% accuracy, +4% macro-F1 over SoTA |
| World-model rollout | MSE, next-sentence accuracy | 1% accuracy, 0.5% rollout error from SoTA |
Ablation experiments with MELD, HellaSwag, and MMLU demonstrate that filtering emotional cues (removal) degrades subjective accuracy by up to 8–10% and objective accuracy by 1–3%, establishing emotion as a systematic modulator across reasoning domains.
6. Key Results, Insights, and Comparative Analysis
Empirical findings from model ablations and qualitative analysis show:
- Emotion-aware transition factorization improves social-behavior prediction by 4–6%.
- Emotion-consistency regularizer () enhances rollout stability and visual coherence.
- Purely physics-based models fail to predict affect-driven scenarios (e.g., impulsive spending under sadness, affiliative gestures under joy), whereas LEWM robustly forecasts both physical and emotional world transitions.
A plausible implication is that integrating affective states at the core of world modeling brings predictions closer to actual human reasoning, particularly in social and psychologically complex environments.
7. Limitations, Open Challenges, and Future Prospects
Several open issues remain:
- EWH dataset construction uses weak supervision via LMM, limiting granularity in emotion annotation (not covering intensity or mixed emotions).
- Long-horizon planning and interactive evaluation (human-in-the-loop) are underdeveloped.
- Future directions proposed include integration with LLM-based world knowledge, affective modeling using richer valence/arousal dimensions, and enabling real-time adaptation for interactive agents.
This suggests that extending LEWM for fine-grained emotional representations and interactive applications may yield further gains in social reasoning fidelity. Overall, LEWM rigorously demonstrates that emotion, when explicitly modeled alongside physical dynamics, enables superior prediction and understanding of both what unfolds in the world and how subjective states evolve (Song et al., 30 Dec 2025).