Papers
Topics
Authors
Recent
2000 character limit reached

Spatially Situated Social Intelligence Test (S3IT)

Updated 30 December 2025
  • S3IT is a suite of benchmarks that evaluates AI agents' social reasoning and spatial cognition through both 3D embodied seat-ordering and gridworld Theory of Mind tasks.
  • It leverages multi-objective optimization and POMDP frameworks to rigorously quantify agents’ performance across complex social-physical scenarios.
  • Evaluation metrics indicate that while agents may match human conflict response cues, they struggle with spatial grounding and sequential belief updating.

The Spatially Situated Social Intelligence Test (S3^{3}IT) is a suite of benchmarks designed to rigorously evaluate the integration of social reasoning and spatial intelligence in artificial agents within physically and socially complex environments. S3^{3}IT comprises two principal instantiations: a 3D embodied seat-ordering benchmark for embodied social intelligence (Sun et al., 23 Dec 2025), and a multi-agent gridworld question-answering framework targeting spatially situated Theory of Mind (ToM) reasoning (Bortoletto et al., 5 Sep 2025). Together, they constitute a comprehensive set of methodologies for probing both social cognition and physical constraint integration in LLM and vision–LLM (VLM) agents.

1. Formal Task Definitions and Mathematical Frameworks

(A) Embodied Seat-Ordering as Multi-Objective Optimization

In the 3D seat-ordering S3^{3}IT benchmark, the core task is to find an injective assignment A:NSA: N \to S mapping nn non-player characters (NPCs) each with a set of preferences PiP_i onto mnm \geq n chairs in a simulated 3D environment (Sun et al., 23 Dec 2025). Preferences are organized in CC categories (embodied, social, conflict), and each item pi,jcp_{i, j|c} has an associated weight wi,jc{1,2,3}w_{i,j|c} \in \{1,2,3\}.

Preference satisfaction is formalized as: $g_{i,j|c}(A) = \begin{cases} 1 & \text{if %%%%10%%%% satisfies %%%%11%%%%} \ 0 & \text{otherwise} \end{cases}$ Raw satisfaction score in category cc: sc(A)=i=1nj=1mi,cwi,jc  gi,jc(A)i=1nj=1mi,cwi,jcs_c(A) = \frac{\sum_{i=1}^n \sum_{j=1}^{m_{i,c}} w_{i,j|c}\;g_{i,j|c}(A)}{\sum_{i=1}^n \sum_{j=1}^{m_{i,c}} w_{i,j|c}} A nonlinear penalty mapping F:[0,1][0,1]F: [0,1]\to[0,1] penalizes partial satisfaction: F(x)=10.87x5+21.99x412.65x3+2.568x20.045xF(x) = -10.87 x^5 + 21.99 x^4 - 12.65 x^3 + 2.568 x^2 - 0.045 x The global utility is: U(A)=c=1CWcF(sc(A))withWc=i,jwi,jcU(A)=\sum_{c=1}^C W_c\,F(s_c(A)) \qquad \text{with}\quad W_c = \sum_{i,j} w_{i,j|c}

Assignment is subject to spatial and social constraints including adjacency prohibitions for conflicting pairs and seating geometry. The optimal assignment AA^* maximizes U(A)U(A) under injectivity and physical feasibility.

(B) Multi-Agent Gridworld and Partially Observable Markov Decision Process (POMDP) Formalism

The ToM-SSI instantiation of S3^{3}IT operationalizes multi-agent social cognition as a discrete gridworld POMDP (Bortoletto et al., 5 Sep 2025). Given up to N=4N=4 agents in an M×MM \times M grid, the global state is sSX×Is \in \mathcal{S} \equiv X \times I with:

  • X=(x1,y1,...,xN,yN)X = (x_1, y_1, ..., x_N, y_N), agent positions,
  • I=(IA1,...,IAN)I = (I_{A_1}, ..., I_{A_N}), knowledge sets per agent.

Actions per agent are movement along grid axes ({up,down,left,right}\{up,down,left,right\}) or communication (comm(i)comm(i) for iIAji \in I_{A_j}). Deterministic transitions update agent positions and share information with adjacent agents (L1L_1-distance 1\leq1, including diagonals). Observations ojo_j reveal the full position map and locally audible communications: $O_j(o_j|s) = 1 \iff o_j \text{ reports %%%%28%%%% and precisely those communications from neighboring agents}$ Belief updates are rational but non-probabilistic, assuming agents infer the most probable state compatible with their percepts and known rules.

2. Scenario Generation and Task Variants

Procedurally Extensible 3D Scenes and Dialogue (Embodied Benchmark)

Scenarios are generated by procedurally sampling from a template set of 3D layouts (T1T_1T5T_5) with varying table/room configurations and seating graphs (Sun et al., 23 Dec 2025). NPC selection leverages a predefined resident world with intricate social graphs and support for 1–5 embodied/social preferences and 0–2 interpersonal conflicts per NPC, drawn to match empirical frequency distributions. For each test instance, a “reverse-engineered” construction guarantees that ground-truth assignments satisfy all derived constraints.

NPC preference profiles are elicited by the test agent (“T-Agent”) via rule-based dialogue, directly querying needs (e.g., “Do you want to be near a window?”) and recording responses and meta-data. Dialogue is subsequently processed by an LLM summarizer, mapping free-form interaction to structured constraints.

Multimodal Gridworld Q&A (ToM-SSI Benchmark)

Each sample is composed of:

  • A rendered or ASCII grid,
  • A natural-language description of context, initial knowledge, agent attitudes, and event,
  • A multiple-choice question (with answer) about a target agent’s percept, belief, or intention.

Five scenario types are implemented: Cooperative Movement–Single Communication (CMSC), Cooperative Movement–Concurrent Communication (CMCC), Probabilistic Cooperative Communication (PCC), Obstructive Communication (OC), and Mixed Cooperative-Obstructive Communication (MC). Group sizes range from dyadic to tetradic interactions. All communication and knowledge update rules strictly adhere to the formal POMDP process.

3. Evaluation Pipelines and Metrics

Embodied Social Optimization

The evaluation in the 3D seat-ordering task utilizes:

  • Category-level satisfaction sc(A)s_c(A), mapped and weighted into the composite score U(A)U(A) (range: 0–100),
  • Prioritization gap (PG): the difference in satisfaction fraction between strong (w=3w=3) and weak (w=1w=1) preferences:

$PG = r_{high} - r_{low} \quad \text{where %%%%35%%%% and %%%%36%%%% are fractions of satisfied strong and weak preferences, respectively}$

An iterative “generate-and-reflect” loop guides agent behavior: at each round, a seating proposal A(t)A^{(t)} is made, a reflection report R(t)R^{(t)} identifies unmet preferences, and the context is refined toward convergence or budget exhaustion.

Theory of Mind Reasoning

Accuracy is measured for Percept, Belief, Intention, and their conjunctions: AccP=1Ni=1N1[p^i=pi],AccB=1Ni=1N1[b^i=bi],AccI=1Ni=1N1[^i=i]Acc_P = \frac1N\sum_{i=1}^N \mathbf{1}[\hat p_i=p_i], \quad Acc_B = \frac1N\sum_{i=1}^N \mathbf{1}[\hat b_i=b_i], \quad Acc_I = \frac1N\sum_{i=1}^N \mathbf{1}[\hat \ell_i=\ell_i]

AccPB=1Ni=1N1[p^i=pib^i=bi],AccPBI=1Ni=1N1[p^i=pib^i=bi^i=i]Acc_{PB} = \frac1N\sum_{i=1}^N \mathbf{1}[\hat p_i=p_i \wedge \hat b_i=b_i], \quad Acc_{PBI} = \frac1N\sum_{i=1}^N \mathbf{1}[\hat p_i=p_i \wedge \hat b_i=b_i \wedge \hat \ell_i=\ell_i]

These metrics are applied to performance on 6,000 questions (balanced across tasks and question types), without a train/validation/test split.

4. Empirical Findings and Analysis

Setting Human Avg. Best LLM/VLM Avg. Embodied Social Conflict
3D Seat-Ordering (Sun et al., 23 Dec 2025) 84.7 47.8 (Gemini-2.5-pro) 40.6 (Best) 56.2 (Best) 85.7 (Best)
POMDP/ToM-SSI (Bortoletto et al., 5 Sep 2025) 73–85% (PBI) <30% (PBI) o4-mini, Claude 3.5

Humans achieve substantially higher PBI/conjunctive scores and balanced satisfaction across preference strengths. LLMs reach near-human conflict satisfaction when cues are explicit, but exhibit severe deficiencies in spatial and embodied constraint satisfaction as well as higher prioritization gaps. Ablation results confirm that providing ground-truth perception drastically reduces the performance gap, indicating spatial grounding as the limiting factor. Reflective iterations improve performance incrementally (3–6 points on average).

In gridworld, model performance sharply declines from Percept to Belief to Intention prediction. Models excel at straightforward adjacency detection, but substantially underperform in tracking hidden knowledge and predicting intention-based planning. CMCC tasks (concurrent communication and second-order inference) are especially challenging (≤5% PBI for models).

Common model failures include disregarding given initial knowledge (Llama-3.2), misencoding grid coordinates (GPT-4o), and overlooking cascading communications in mixed-motive tasks.

5. Limitations and Open Technical Challenges

Both S3^{3}IT benchmarks are characterized by certain simplifications and limitations:

  • The 3D seat-ordering environment adopts discrete viewpoint sets; continuous 6-DoF exploration is not required.
  • All NPCs are cooperative; adversarial or noisy responses are not modeled.
  • The nonlinear penalty function F(x)F(x) is user-designed rather than learned or calibrated.
  • The human baseline in (Sun et al., 23 Dec 2025) is small (n=3n=3); test scenarios cover 70 hand-selected cases.
  • In the ToM-SSI gridworld, scenarios lack continuous time, rich perceptual streams, or advanced path planning; only basic cooperation/obstruction motives are modeled, with group size capped at four.

Open problems include:

  • Extending to richer modalities (video, 3D geometry, speech) and dynamic movement planning.
  • Incorporating adversarial social reasoning, memory-limited or noisy observational channels, larger team settings, and multi-objective competition.
  • Scaling up human participation and testbed range.
  • Generalizing task structure (e.g., beyond seats to collaborative object placement).

A plausible implication is that future research must address spatial representation and multi-agent belief inference as primary obstacles to the deployment of genuinely socially intelligent embodied agents.

6. Scientific Significance and Future Research Directions

S3^{3}IT delivers the first rigorous, large-scale, multimodal benchmarks marrying spatially distributed social cognition with physical environmental constraints, thus addressing foundational gaps in both ToM and embodied AI literature. The frameworks’ explicit multi-objective scoring, procedural task synthesis, and automated evaluation pipelines enable controlled difficulty scaling and systematic ablation analysis.

Future research directions highlighted by these benchmarks include:

  • Interactive evaluation with RL agents trained end-to-end,
  • Realistic physical exploration (continuous environments, active perception),
  • Robust inference under noise and social deception,
  • Direct learning or calibration of satisfaction penalty functions,
  • Expansion to general embodied group collaboration domains beyond seat arrangement.

S3^{3}IT thus constitutes a critical platform for advancing the frontier of embodied social intelligence in LLM- and VLM-driven agents, systematically exposing bottlenecks in current spatial grounding and ToM generalization capabilities (Sun et al., 23 Dec 2025, Bortoletto et al., 5 Sep 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Spatially Situated Social Intelligence Test (S$^{3}$IT).