Papers
Topics
Authors
Recent
2000 character limit reached

Large Emotional World Model (LEWM)

Updated 4 January 2026
  • Large Emotional World Model (LEWM) is a framework that integrates both physical and emotional state transitions for sequential decision-making.
  • The model employs a two-stage factorization and coupled transition heads to predict emotion-driven changes alongside physical world dynamics.
  • Empirical results on the EWH dataset demonstrate that incorporating explicit emotional cues improves prediction stability and social reasoning accuracy.

Large Emotional World Model (LEWM) is a modeling framework that extends conventional world models by systematically incorporating emotion as a core explanatory and predictive factor in sequential decision-making environments. Unlike standard models that prioritize physical-world regularities, LEWM represents state transitions as involving both world states and emotional states, enabling high-fidelity prediction of both objective and subjective social behaviors across complex multimodal scenes (Song et al., 30 Dec 2025).

1. Problem Formulation and Mathematical Framework

The LEWM formalism augments the standard agent-world interaction loop. At discrete time tt, the agent's observation stSs_t \in S (video, audio, image), action atAa_t \in A (natural-language), and emotional state etEe_t \in E (compact embedding via facial expression and context) jointly mediate transitions to next states (st+1,et+1)(s_{t+1},e_{t+1}) governed by:

(st+1,et+1)=fθ(st,at,et)(s_{t+1}, e_{t+1}) = f_\theta(s_t, a_t, e_t)

LEWM factorizes the conditional joint over transitions as:

p(st+1,et+1st,at,et)=p(et+1ht)p(st+1ht,et+1)p(s_{t+1}, e_{t+1} | s_t, a_t, e_t) = p(e_{t+1} | h_t) \cdot p(s_{t+1} | h_t, e_{t+1})

with ht=[EncS(st);EncA(at);EncE(et)]h_t = [\mathrm{Enc}_S(s_t); \mathrm{Enc}_A(a_t); \mathrm{Enc}_E(e_t)] as the fused latent conditioning vector. This two-stage factorization models emotion as an explicit modulator for world-state progressions, capturing theory of mind intuitions central to social reasoning.

2. Emotion-Why-How (EWH) Dataset Construction

LEWM leverages the Emotion-Why-How (EWH) dataset, constructed from a large corpus of real-world multimodal social scenes including movie clips, TV shows, and first-person recordings. Data annotation circumvents manual labeling by utilizing a pretrained large multimodal model (LMM) to:

  1. Identify segments where emotion causally drives behavior.
  2. Generate natural-language behavioral descriptions (AA).
  3. Infer emotional state (EtE_t) using semantic and facial-expression cues.

State StS_t aggregates synchronized key frames (VtV_t), audio (AtA_t), and image (ItI_t). Each tuple is constructed as (S0,E0),A,(S1,E1)\langle (S_0,E_0),A,(S_1,E_1) \rangle, with causal annotations:

  • Why: S0AS1S_0 \rightarrow A \rightarrow S_1
  • How: S0E0AE1S1S_0 \rightarrow E_0 \rightarrow A \rightarrow E_1 \rightarrow S_1

Emotion categories span the canonical discrete set: joy, sadness, anger, fear, surprise, disgust. The dataset comprises 10510^510610^6 tuples across diverse social contexts, capturing both the causality behind actions ("why") and the mechanism of emotional transitions ("how").

3. Model Architecture and Module Design

LEWM consists of four principal components:

  1. Visual Encoder (EncS\mathrm{Enc}_S): Processes multimodal input via
    • CNN/transformer for images,
    • 1D-CNN or transformer for audio,
    • 3D-CNN or ViT for video frames. Outputs are pooled into a latent vector ztz_t.
  2. Action Encoder (EncA\mathrm{Enc}_A): Encodes natural-language descriptions using a transformer or Bi-LSTM into vector ata_t.
  3. Emotion Encoder (EncE\mathrm{Enc}_E): Integrates emotion label and facial expression embedding via a two-layer MLP to produce ete_t.
  4. Coupled Transition Heads:
    • Emotion Transition Head (TE\mathcal{T}_E): hte^t+1h_t \mapsto \hat{e}_{t+1}.
    • State Transition Head (TS\mathcal{T}_S): [ht;e^t+1]z^t+1[h_t; \hat{e}_{t+1}] \mapsto \hat{z}_{t+1}.
    • State Decoder (DecS\mathrm{Dec}_S): z^t+1S^t+1\hat{z}_{t+1} \mapsto \hat{S}_{t+1}.

Attention layers facilitate multimodal fusion. The architecture avoids explicit recurrence, instead stacking predictions for sequential rollouts.

4. Training Objectives and Loss Structure

Training minimizes a composite objective integrating reconstruction fidelity and emotional accuracy:

  • State Reconstruction Loss:

Lstate=Φ(St+1)Φ(S^t+1)2\mathcal{L}_{state} = \|\Phi(S_{t+1}) - \Phi(\hat{S}_{t+1})\|^2

where Φ()\Phi(\cdot) is a learned feature mapping (e.g., via a visual backbone).

  • Emotion Prediction Loss:

Lemotion=c1[Et+1=c]logp(e^t+1=c)\mathcal{L}_{emotion} = -\sum_c \mathbb{1}[E_{t+1}=c] \log p(\hat{e}_{t+1}=c)

(cross-entropy over discrete classes).

Combined joint loss:

L=Lstate+λLemotion\mathcal{L} = \mathcal{L}_{state} + \lambda \mathcal{L}_{emotion}

with λ\lambda controlling task weight.

Additionally, an emotion-consistency regularizer,

C=TS(ht,e^t+1+δ)TS(ht,e^t+1)\mathcal{C} = \|\mathcal{T}_S(h_t, \hat{e}_{t+1} + \delta) - \mathcal{T}_S(h_t, \hat{e}_{t+1})\|

(where δ\delta is a small perturbation), discourages spurious world-state changes under minor emotional shifts. Final objective:

minL+βC\min \mathcal{L} + \beta \mathcal{C}

(β\beta regulates state–emotion coupling).

5. Inference, Rollouts, and Evaluation Procedure

Inference with LEWM follows an autoregressive schema. Given initial (S0,E0)(S_0, E_0) and an action sequence {A0,,Ak1}\{A_0,\dots,A_{k-1}\}, progression for t=0t=0 to k1k-1 entails:

  1. Encode ht=[EncS(St);EncA(At);EncE(Et)]h_t = [\mathrm{Enc}_S(S_t); \mathrm{Enc}_A(A_t); \mathrm{Enc}_E(E_t)].
  2. Predict emotion e^t+1=TE(ht)\hat{e}_{t+1} = \mathcal{T}_E(h_t).
  3. Predict world-state latent z^t+1=TS(ht,e^t+1)\hat{z}_{t+1} = \mathcal{T}_S(h_t, \hat{e}_{t+1}).
  4. Decode multimodal output S^t+1=DecS(z^t+1)\hat{S}_{t+1} = \mathrm{Dec}_S(\hat{z}_{t+1}).
  5. Set (St+1,Et+1)=(S^t+1,e^t+1)(S_{t+1}, E_{t+1}) = (\hat{S}_{t+1}, \hat{e}_{t+1}).

This yields predicted audiovisual–emotional trajectories. Evaluation benchmarks include:

Task Type Metric Observed Impact (LEWM)
Emotion-driven prediction Emotion accuracy / macro-F1 +6% accuracy, +4% macro-F1 over SoTA
World-model rollout MSE, next-sentence accuracy ±\pm1% accuracy, ±\pm0.5% rollout error from SoTA

Ablation experiments with MELD, HellaSwag, and MMLU demonstrate that filtering emotional cues (removal) degrades subjective accuracy by up to 8–10% and objective accuracy by 1–3%, establishing emotion as a systematic modulator across reasoning domains.

6. Key Results, Insights, and Comparative Analysis

Empirical findings from model ablations and qualitative analysis show:

  • Emotion-aware transition factorization improves social-behavior prediction by 4–6%.
  • Emotion-consistency regularizer (C\mathcal{C}) enhances rollout stability and visual coherence.
  • Purely physics-based models fail to predict affect-driven scenarios (e.g., impulsive spending under sadness, affiliative gestures under joy), whereas LEWM robustly forecasts both physical and emotional world transitions.

A plausible implication is that integrating affective states at the core of world modeling brings predictions closer to actual human reasoning, particularly in social and psychologically complex environments.

7. Limitations, Open Challenges, and Future Prospects

Several open issues remain:

  • EWH dataset construction uses weak supervision via LMM, limiting granularity in emotion annotation (not covering intensity or mixed emotions).
  • Long-horizon planning and interactive evaluation (human-in-the-loop) are underdeveloped.
  • Future directions proposed include integration with LLM-based world knowledge, affective modeling using richer valence/arousal dimensions, and enabling real-time adaptation for interactive agents.

This suggests that extending LEWM for fine-grained emotional representations and interactive applications may yield further gains in social reasoning fidelity. Overall, LEWM rigorously demonstrates that emotion, when explicitly modeled alongside physical dynamics, enables superior prediction and understanding of both what unfolds in the world and how subjective states evolve (Song et al., 30 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Large Emotional World Model (LEWM).

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube