Large Emotional World Model (LEWM)

Updated 4 January 2026

Large Emotional World Model (LEWM) is a framework that integrates both physical and emotional state transitions for sequential decision-making.
The model employs a two-stage factorization and coupled transition heads to predict emotion-driven changes alongside physical world dynamics.
Empirical results on the EWH dataset demonstrate that incorporating explicit emotional cues improves prediction stability and social reasoning accuracy.

Large Emotional World Model (LEWM) is a modeling framework that extends conventional world models by systematically incorporating emotion as a core explanatory and predictive factor in sequential decision-making environments. Unlike standard models that prioritize physical-world regularities, LEWM represents state transitions as involving both world states and emotional states, enabling high-fidelity prediction of both objective and subjective social behaviors across complex multimodal scenes (Song et al., 30 Dec 2025).

1. Problem Formulation and Mathematical Framework

The LEWM formalism augments the standard agent-world interaction loop. At discrete time $t$ , the agent's observation $s_t \in S$ (video, audio, image), action $a_t \in A$ (natural-language), and emotional state $e_t \in E$ (compact embedding via facial expression and context) jointly mediate transitions to next states $(s_{t+1},e_{t+1})$ governed by:

$(s_{t+1}, e_{t+1}) = f_\theta(s_t, a_t, e_t)$

LEWM factorizes the conditional joint over transitions as:

$p(s_{t+1}, e_{t+1} | s_t, a_t, e_t) = p(e_{t+1} | h_t) \cdot p(s_{t+1} | h_t, e_{t+1})$

with $h_t = [\mathrm{Enc}_S(s_t); \mathrm{Enc}_A(a_t); \mathrm{Enc}_E(e_t)]$ as the fused latent conditioning vector. This two-stage factorization models emotion as an explicit modulator for world-state progressions, capturing theory of mind intuitions central to social reasoning.

2. Emotion-Why-How (EWH) Dataset Construction

LEWM leverages the Emotion-Why-How (EWH) dataset, constructed from a large corpus of real-world multimodal social scenes including movie clips, TV shows, and first-person recordings. Data annotation circumvents manual labeling by utilizing a pretrained large multimodal model (LMM) to:

Identify segments where emotion causally drives behavior.
Generate natural-language behavioral descriptions ( $A$ ).
Infer emotional state ( $E_t$ ) using semantic and facial-expression cues.

State $S_t$ aggregates synchronized key frames ( $V_t$ ), audio ( $A_t$ ), and image ( $I_t$ ). Each tuple is constructed as $\langle (S_0,E_0),A,(S_1,E_1) \rangle$ , with causal annotations:

Why: $S_0 \rightarrow A \rightarrow S_1$
How: $S_0 \rightarrow E_0 \rightarrow A \rightarrow E_1 \rightarrow S_1$

Emotion categories span the canonical discrete set: joy, sadness, anger, fear, surprise, disgust. The dataset comprises $10^5$ – $10^6$ tuples across diverse social contexts, capturing both the causality behind actions ("why") and the mechanism of emotional transitions ("how").

3. Model Architecture and Module Design

LEWM consists of four principal components:

Visual Encoder ( $\mathrm{Enc}_S$ ): Processes multimodal input via
- CNN/transformer for images,
- 1D-CNN or transformer for audio,
- 3D-CNN or ViT for video frames. Outputs are pooled into a latent vector $z_t$ .
Action Encoder ( $\mathrm{Enc}_A$ ): Encodes natural-language descriptions using a transformer or Bi-LSTM into vector $a_t$ .
Emotion Encoder ( $\mathrm{Enc}_E$ ): Integrates emotion label and facial expression embedding via a two-layer MLP to produce $e_t$ .
Coupled Transition Heads:
- Emotion Transition Head ( $\mathcal{T}_E$ ): $h_t \mapsto \hat{e}_{t+1}$ .
- State Transition Head ( $\mathcal{T}_S$ ): $[h_t; \hat{e}_{t+1}] \mapsto \hat{z}_{t+1}$ .
- State Decoder ( $\mathrm{Dec}_S$ ): $\hat{z}_{t+1} \mapsto \hat{S}_{t+1}$ .

Attention layers facilitate multimodal fusion. The architecture avoids explicit recurrence, instead stacking predictions for sequential rollouts.

4. Training Objectives and Loss Structure

Training minimizes a composite objective integrating reconstruction fidelity and emotional accuracy:

State Reconstruction Loss:

$\mathcal{L}_{state} = \|\Phi(S_{t+1}) - \Phi(\hat{S}_{t+1})\|^2$

where $\Phi(\cdot)$ is a learned feature mapping (e.g., via a visual backbone).

Emotion Prediction Loss:

$\mathcal{L}_{emotion} = -\sum_c \mathbb{1}[E_{t+1}=c] \log p(\hat{e}_{t+1}=c)$

(cross-entropy over discrete classes).

Combined joint loss:

$\mathcal{L} = \mathcal{L}_{state} + \lambda \mathcal{L}_{emotion}$

with $\lambda$ controlling task weight.

Additionally, an emotion-consistency regularizer,

$\mathcal{C} = \|\mathcal{T}_S(h_t, \hat{e}_{t+1} + \delta) - \mathcal{T}_S(h_t, \hat{e}_{t+1})\|$

(where $\delta$ is a small perturbation), discourages spurious world-state changes under minor emotional shifts. Final objective:

$\min \mathcal{L} + \beta \mathcal{C}$

( $\beta$ regulates state–emotion coupling).

5. Inference, Rollouts, and Evaluation Procedure

Inference with LEWM follows an autoregressive schema. Given initial $(S_0, E_0)$ and an action sequence $\{A_0,\dots,A_{k-1}\}$ , progression for $t=0$ to $k-1$ entails:

Encode $h_t = [\mathrm{Enc}_S(S_t); \mathrm{Enc}_A(A_t); \mathrm{Enc}_E(E_t)]$ .
Predict emotion $\hat{e}_{t+1} = \mathcal{T}_E(h_t)$ .
Predict world-state latent $\hat{z}_{t+1} = \mathcal{T}_S(h_t, \hat{e}_{t+1})$ .
Decode multimodal output $\hat{S}_{t+1} = \mathrm{Dec}_S(\hat{z}_{t+1})$ .
Set $(S_{t+1}, E_{t+1}) = (\hat{S}_{t+1}, \hat{e}_{t+1})$ .

This yields predicted audiovisual–emotional trajectories. Evaluation benchmarks include:

Task Type	Metric	Observed Impact (LEWM)
Emotion-driven prediction	Emotion accuracy / macro-F1	+6% accuracy, +4% macro-F1 over SoTA
World-model rollout	MSE, next-sentence accuracy	$\pm$ 1% accuracy, $\pm$ 0.5% rollout error from SoTA

Ablation experiments with MELD, HellaSwag, and MMLU demonstrate that filtering emotional cues (removal) degrades subjective accuracy by up to 8–10% and objective accuracy by 1–3%, establishing emotion as a systematic modulator across reasoning domains.

6. Key Results, Insights, and Comparative Analysis

Empirical findings from model ablations and qualitative analysis show:

Emotion-aware transition factorization improves social-behavior prediction by 4–6%.
Emotion-consistency regularizer ( $\mathcal{C}$ ) enhances rollout stability and visual coherence.
Purely physics-based models fail to predict affect-driven scenarios (e.g., impulsive spending under sadness, affiliative gestures under joy), whereas LEWM robustly forecasts both physical and emotional world transitions.

A plausible implication is that integrating affective states at the core of world modeling brings predictions closer to actual human reasoning, particularly in social and psychologically complex environments.

7. Limitations, Open Challenges, and Future Prospects

Several open issues remain:

EWH dataset construction uses weak supervision via LMM, limiting granularity in emotion annotation (not covering intensity or mixed emotions).
Long-horizon planning and interactive evaluation (human-in-the-loop) are underdeveloped.
Future directions proposed include integration with LLM-based world knowledge, affective modeling using richer valence/arousal dimensions, and enabling real-time adaptation for interactive agents.

This suggests that extending LEWM for fine-grained emotional representations and interactive applications may yield further gains in social reasoning fidelity. Overall, LEWM rigorously demonstrates that emotion, when explicitly modeled alongside physical dynamics, enables superior prediction and understanding of both what unfolds in the world and how subjective states evolve (Song et al., 30 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

Large Emotional World Model (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Large Emotional World Model (LEWM).

Large Emotional World Model (LEWM)

1. Problem Formulation and Mathematical Framework

2. Emotion-Why-How (EWH) Dataset Construction

3. Model Architecture and Module Design

4. Training Objectives and Loss Structure

5. Inference, Rollouts, and Evaluation Procedure

6. Key Results, Insights, and Comparative Analysis

7. Limitations, Open Challenges, and Future Prospects

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Large Emotional World Model (LEWM)

1. Problem Formulation and Mathematical Framework

2. Emotion-Why-How (EWH) Dataset Construction

3. Model Architecture and Module Design

4. Training Objectives and Loss Structure

5. Inference, Rollouts, and Evaluation Procedure

6. Key Results, Insights, and Comparative Analysis

7. Limitations, Open Challenges, and Future Prospects

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research