Situational Evaluation of Social Intelligence (SESI)

Updated 19 November 2025

SESI is a comprehensive framework that defines and quantifies social intelligence using dynamic, context-rich scenarios.
It employs recursive Bayesian Theory of Mind and multi-order reasoning to model and evaluate adaptive social cognition.
SESI drives innovation through interactive tasks, multimodal benchmarks, and precise metrics aligned with psychological and game-theoretic constructs.

The Situational Evaluation of Social Intelligence (SESI) is a research paradigm, theoretical framework, and operational protocol for rigorously assessing social intelligence—both in humans and artificial agents—in contextually rich, dynamic, and interactive scenarios. SESI was devised to overcome the limitations of static, trait-based, or purely academic social reasoning tests by grounding the evaluation in concrete situations that require adaptive, multi-order, and multi-agent social cognition. It prescribes specific task structures, computational models of mental-state reasoning (notably, recursive Bayesian ToM), and quantitative metrics tightly aligned with psychological and game-theoretic constructs. SESI is widely adopted in recent benchmarks tailored for evaluating social intelligence in LLMs, vision-LLMs, and multimodal agents.

1. Theoretical Foundations and Core Constructs

SESI draws from both cognitive science and multi-agent systems theory to formalize social intelligence as the agent’s expected performance across a weighted distribution of social environments populated by other agents of varying policies and abilities. Two intersecting theoretical strands underlie SESI:

Multi-Order Theory of Mind (ToM): Social cognition is modeled as a hierarchy of increasingly recursive mental-state attributions—zeroth-order (egocentric planning), first-order (attribution of another’s beliefs or desires), and higher orders (beliefs about beliefs). The N-Minds framework captures this recursive structure and underpins task design for evaluating both basic and advanced ToM competencies (Wang et al., 2024).
Parametric Social Intelligence Definition: The formal SESI score for an agent πᵢ is given as a weighted average reward across sampled environments, line-ups of other agents, and slots:

$\mathrm{SESI}(\pi_i) = \! \sum_{\mu \in \mathcal{M}} w_M(\mu) \sum_{i=1}^{N(\mu)} w_S(i,\mu) \sum_{\ell \in L_{-i}(\Pi)} w_I(\ell) \cdot \overline{R}_i \bigl(\mu[\ell[i \leftarrow \pi_i]]\bigr)$

Here, the dependencies on chosen environments, team structures, and opposing/cooperating agent mixes are made explicit (Insa-Cabrera et al., 2014, Insa-Cabrera et al., 2012).

Fundamental SESI Properties: The quality of social intelligence tests is bolstered by measuring action-dependence (AD), reward-dependence (RD), slot-reward-dependence (SRD), competitive anticipation (AComp), and cooperative anticipation (ACoop), ensuring sensitivity to social context and interaction (Insa-Cabrera et al., 2014).

2. Task Designs and Operationalizations

SESI provides the scaffolding for a spectrum of evaluation protocols. The most influential instantiations are:

Inverse Reasoning (IR): The observer must infer the actor’s preferences based on observed behavior in a partially observable environment.
Inverse–Inverse Planning (IIP): The actor selects actions not only to achieve its own goal but to guide the observer toward a correct inference of its intent, necessitating second-order ToM-level planning (Wang et al., 2024).
Multi-Turn Dialogue and Social Scenario MCQs: Real-world, crowd-validated scenarios are cast as multiple-choice questions probing five subcapacities—empathy, social cognition, self-presentation, influence, and concern—with answer selection guided by social consensus rather than pure logic (Xu et al., 2024).
Multimodal and Situated Tasks: Recent SESI-aligned benchmarks utilize video clips (SIV-Bench) or interactive grid-worlds (ToM-SSI) to force integration of visual, spatial, and textual cues and to stress-test context tracking, belief updating, and intention attribution across multiple actors and shifting knowledge partitions (Kong et al., 5 Jun 2025, Bortoletto et al., 5 Sep 2025).
Multiagent, Cooperative-Competitive Decision Processes: The Darwin–Wallace multi-generational formalism is adopted to generate tasks with controlled distributions over agent abilities, team structures, and evolving social complexity, extending applicability to reinforcement-learning and model-based agents (Insa-Cabrera et al., 2012).

3. Computational and Scoring Frameworks

Advanced SESI implementations employ recursive Bayesian inference to model belief, desire, and intention updates:

Recursive Bayesian ToM: Posterior distributions over hidden goals or state variables are updated according to observed actions and prior task structure, with order-k updates enabling alternation between agent and observer perspectives:

$P^{2k+1}(h|\gamma),\qquad P^{2k+2}(\gamma|h)$

The likelihood M(γ, h) integrates urgency decay, cost-sensitivity, and signal intensity functions, specialized for IR and IIP task regimes (Wang et al., 2024).

Outcome and Process Scoring:
- Goal Achievement Evaluation (GAE): Fraction of scenarios in which the agent achieves the designated social goal, computed as terminal rewards in an episodic MDP.
- Interpersonal Ability Evaluation (IAE): Accuracy in selecting utterances or actions representing specified interpersonal skills or social capacities (e.g. empathy, persuasion) (Zhou et al., 1 Jun 2025).
- Dialogue Process/Reply Quality: Ground-truth–referenced ratings of process reasoning (e.g. motivation, emotional inference, communicative strategy) versus final reply effect, using multi-dimensional rubrics in dialogue benchmarks (Huang et al., 27 Oct 2025).
Fine-Grained Metrics: Exact-match accuracy (multiple-choice), Precision/Recall/F₁ (for intention inference tasks), and multi-dimensional statistical correlation with academic intelligence scores are standard performance indices (Xu et al., 2024, Liu et al., 2024).

4. Experimental Protocols and Representative Benchmarks

SESI’s rigorous situational approach is manifest in several major benchmarks:

Benchmark	Modality/Context	Core Social Intelligence Facets
SESI-Han	Grid MDPs	Recursive ToM (orders 0–2), IR/IIP, Bayesian model
SESI-Goleman	Reddit MCQs	Empathy, Cognition, Presentation, Influence, Concern
SocialIQA	Textual MCQ	Motivation, Emotion, Intention, Preconditions
SIV-Bench	Video QA	SSU, SSR, SDP (scene, state, dynamics prediction)
SocialEval	Script Trees	GAE, IAE, ability-trait tree traversal
ToM-SSI	Grid world (visual)	Multimodal, multi-agent, perspective/belief tracking
SI-Bench	Natural dialogue	Contextual reasoning, reply quality, communicative strategy

Human baselines are established for all domains, and leading LLMs are evaluated in zero-shot and few-shot settings. LLM performance is consistently below human performance in high-order ToM and dynamic social inference, with SOTA LLMs constrained to order-0 (pattern recognition or cost minimization) in SESI-Han and marked reply-quality deficits in authentic social dialogues (Wang et al., 2024, Huang et al., 27 Oct 2025).

5. Key Empirical Findings and Model Limitations

Quantitative analyses reveal:

Order Limitation: Even GPT-4 variants consistently fail to exhibit genuine ToM order ≥2, manifesting in an inability to engage in recursive belief reasoning, counterfactual planning, or context-adaptive signaling (Wang et al., 2024).
Pattern Recognition Shortcuts: LLMs, particularly those fine-tuned on SESI-style data, can achieve near-perfect performance in seen trajectory/route types but catastrophically fail on modified or out-of-distribution social scenarios—indicative of superficial pattern matching, not abstract generalization.
Error Modes:

Superficial Friendliness: Models default to generically positive, context-insensitive advice, a by-product of RLHF alignment that undercuts nuanced social reasoning (Xu et al., 2024).
Question Sidestepping: Frequently providing plausible answers to the wrong subcapacity.
Literalism in Dialogue: Over-literal execution of communicative strategies when prompted to do explicit reasoning, leading to decreased reply quality relative to direct humanlike responses (Huang et al., 27 Oct 2025).

Prosocial Biases: LLMs preferentially select positive or agreeable behaviors, often to the detriment of achieving hard goals or navigating antagonistic social intent; humans exhibit more flexible polarity adaptation (Zhou et al., 1 Jun 2025).

6. Methodological and Design Properties

SESI evaluation emphasizes critical test properties:

Action- and Reward-Dependence: Agents are evaluated on the extent to which their choices and outcomes depend functionally on the presence and policies of others, operationalized by formal dependencies (AD, RD, SRD) (Insa-Cabrera et al., 2014).
Discriminability, Validity, and Reliability: Task structures are constructed for fine and coarse grading (pairwise, total, and partial orderings of agent performance) and are subjected to subsampling and repeated-measurement analyses for psychometric robustness.
Scenario Diversity: Environment distributions are parameterized for agent number, team arrangement, reward-sharing schemes, complexity, and communication topologies, ensuring coverage from simple dyads to nested, multi-party adversarial and cooperative setups (Insa-Cabrera et al., 2012, Bortoletto et al., 5 Sep 2025).

7. Gaps, Challenges, and Future Directions

Current SESI formulations expose persistent chasms between human and artificial social intelligence:

Generalization and Robustness: LLMs lack zero-shot and few-shot flexibility in novel or perturbed social contexts, in contrast to human adaptability (Wang et al., 2024).
Multimodal Integration: Vision-LLMs struggle more with relation inference and social state reasoning than with shallow perceptual or factual tasks, revealing the bottleneck in latent state modeling (Kong et al., 5 Jun 2025).
Compositional Social Reasoning: Multi-step belief updating and intention attribution across mixed cooperative/obstructive environments remain unsolved (e.g. marked drop from percept to belief/intention accuracy) (Bortoletto et al., 5 Sep 2025).

Research priorities include extending SESI to encompass:

Third- and higher-order ToM tasks,
Real-time and interactive multi-agent simulations,
Cross-cultural and multilingual scenarios,
Finer-grained partial-credit and continuous-outcome metrics,
Integration of explicit planning/inference modules into foundation models,
Diagnostic protocols for distinguishing genuine reasoning from surface statistical shortcuts,
Extending the SESI toolkit to cover rapport-building, trust calibration, and adaptive group dynamics over multi-turn interactions (Wang et al., 2024, Huang et al., 27 Oct 2025).

SESI is thus both a rigorous scientific framework and an evolving experimental toolkit, vital for benchmarking, understanding, and advancing the development of artificial social intelligence.