Papers
Topics
Authors
Recent
Search
2000 character limit reached

Social Robot Navigation via Explainable Interactions

Updated 20 January 2026
  • The paper introduces a multimodal framework that integrates vision-language models, reinforcement learning, and topological planning to deliver real-time explainable robot navigation.
  • It employs specialized datasets and joint loss functions for fine-tuning, resulting in enhanced trust, reduced conflicts, and improved navigation efficiency.
  • The system generates saliency heatmaps and natural language explanations to ensure transparent, socially compliant interactions in diverse, dynamic environments.

Social Robot Navigation via Explainable Interactions (SNEI) unifies autonomous mobile robot navigation with real-time, human-interpretable explanations, leveraging advances in vision-LLMs (VLMs), reinforcement learning, and multimodal interaction. SNEI operationalizes robot transparency by integrating natural language explanations and visual rationales into the navigation stack, aiming to enhance human trust, predictability, and social compliance in diverse dynamic environments. Recent literature documents its formal architectures, mathematical grounding, user-centered validation, and real-world deployment across multiple paradigms (Sotomi et al., 7 Apr 2025, Kawabata et al., 15 Dec 2025, Girgin et al., 2024).

1. Core Architectural Principles

SNEI is structurally defined by its coupling of perception, explanation, and planning via a multimodal, three-stage pipeline (Sotomi et al., 7 Apr 2025):

  • Perception: The robot ingests sensory data—typically forward-facing RGB images (or combined RGB-LiDAR)—processed through a fine-tuned vision-LLM (e.g., BLIP) to yield dense visual captions (L) and spatial saliency heatmaps (H) via Grad-CAM. Formally, for image XX, the caption and heatmap are

L=fcaption(X),    Hi,j=ReLU(kαkAci,j).L = f_\text{caption}(X),\;\; H_{i,j} = \operatorname{ReLU}\left(\sum_k \alpha_k A_c^{i,j}\right).

  • Multimodal Fusion & Explanation: The tuple (X,H,L)(X, H, L) is fused via feature extraction and spatial pooling (e.g., ψ(X,H)=MLP(concat(Pool(H),ExtractFeatures(X)))\psi(X,H) = \operatorname{MLP}(\text{concat}(\operatorname{Pool}(H), \text{ExtractFeatures}(X)))) and mapped to a concise explanation EE by a LLM Φ\Phi:

E=Φ(ψ(X,H),L).E = \Phi(\psi(X, H), L).

  • Navigation Integration: The socially aware path planner (TEB/ORCA) triggers the explanation pipeline when social conflicts, encoded by h(P,qhumanj)0h(P, q_\text{human}^j) \leq 0, are detected, and dynamically replans while streaming EE to the user.

Explanations are generated in real-time, aligning the robot’s instantaneous “attention” (via HH) and decision (via LL) with concise, context-sensitive summaries (Sotomi et al., 7 Apr 2025).

2. Model Training, Datasets, and Fine-Tuning

SNEI systems are trained on specialized datasets that capture visual, spatial, and social cues in human-robot encounters.

  • Navigation Explanation Dataset: Approximately 50005\,000 images from indoor spaces, each labeled with:
  • Fine-Tuning Procedures:
    • BLIP (VLM): Pretrained on MSCOCO/Flickr30k; further trained for $10$ epochs with cross-entropy caption loss

    Lcaption=tlogp(yty<t,X).\mathcal{L}_\text{caption} = -\sum_t \log p(y_t | y_{<t}, X). - Heatmap Branch: Mean squared error against human-annotated saliency maps,

    Lheatmap=HHgt22.\mathcal{L}_\text{heatmap} = \| H - H_\text{gt} \|_2^2. - Joint Loss:

    Ltotal=Lcaption+λLheatmap,  λ=0.5.\mathcal{L}_\text{total} = \mathcal{L}_\text{caption} + \lambda \mathcal{L}_\text{heatmap}, \; \lambda=0.5. - LLM (Φ\Phi): Prompt-tuned on 20002\,000 joint representations using token-level cross-entropy.

  • SNEI Dataset (Broader Context): 20002\,000 annotated scenarios, 4000040\,000 VQA pairs with even splits over perception, prediction, chain-of-thought, action, and explanation; includes scenario metadata (environment type, crowd density, etc.) (Payandeh et al., 2024).

Alternative architectures utilize mixture-of-experts models (e.g., SocialNav-MoE (Kawabata et al., 15 Dec 2025)) and reinforcement fine-tuning pipelines with semantic similarity rewards for efficient, human-compliant navigation.

3. Explainability Mechanisms and Interaction Modalities

SNEI explicitly integrates explainability at several levels:

  • Visual Explanations: Saliency heatmaps (HH) spatially ground the robot’s focus, visually highlighting influential regions in the scene that drive decision-making.

  • Natural Language Explanations: Generated by LLMs, these summarize robot actions or trajectory changes ("I see a small gathering of people ahead engaged in discussion; let me reroute to give them space" (Sotomi et al., 7 Apr 2025)).

  • Token and Expert-activation Tracing: In MoE-based models, activated expert pathways and token-level cross-attention can be visualized for post-hoc interpretability (Kawabata et al., 15 Dec 2025).

  • Gesture and Dialogue Looping: Some frameworks incorporate bidirectional voice and gesture negotiation, where humans can guide robot choices through hand gestures interpreted by models such as Mediapipe; robots adjust behavior and provide verbal acknowledgments (Girgin et al., 2024).

  • Probabilistic Planning of Explanations: Explanation modalities, detail, duration, and scope are treated as stochastic variables in MDP (RDDL) planners to optimize satisfaction of diverse user preferences (Halilovic et al., 2024).

Real-time latency remains a limiting factor: end-to-end pipelines with remote LLMs currently report average latencies around 20s20\,\mathrm{s} per explanation on embedded hardware (Sotomi et al., 7 Apr 2025), motivating research into compact local models (Kawabata et al., 15 Dec 2025).

4. Evaluation Protocols and Empirical Results

SNEI is validated through a combination of user studies, behavioral metrics, and confusion-matrix-based analysis.

User Studies and Metrics:

  • Demographics: 30 participants, observed four robot runs (with/without SNEI, manual/autonomous modes) (Sotomi et al., 7 Apr 2025).

  • Survey Items: Trust, clarity, understanding, perceived safety (5-point Likert scale).

  • Preference Score (PS):

PS=(U+0.5N)/T×100PS = (U + 0.5\,N)/T \times 100

where U=#U=\# users preferring explanations, NN neutral, TT total.

Results:

  • Navigation:

    • Slight decrease in trajectory length/time with SNEI; reduction in abrupt stops (21 \rightarrow 18) and number of detected conflicts (Sotomi et al., 7 Apr 2025).
  • Subjective Outcomes:
    • Trust: increased by +16.7%+16.7\%.
    • Understanding: increased by +23.3%+23.3\%.
    • Preference for real-time explanations: PS=76.7%PS = 76.7\% (vs. 50%50\% without).
  • Explainability Accuracy:
    • Accuracy: 82.14%82.14\%; Precision: 80.4%80.4\%; Recall: 84.6%84.6\% (on held-out explanations).
  • Qualitative Explanations: Users consistently rated explanations as helping them understand and predict robot choices.
  • Comparative Performance: SocialNav-MoE achieved SMS (sentence-mover’s similarity) of $0.551$ vs. $0.376$ (GPT-4o) and $0.417$ (Claude), while operating 8×8\times to 20×20\times faster on action inference (Kawabata et al., 15 Dec 2025).

5. Topological and Relational Explainability

Some SNEI pipelines define collision-free and deadlock-free "safety regions" using topological features of multi-robot trajectories (Toscano-Duran et al., 14 Feb 2025):

  • Topological Feature Extraction: Persistent entropy over Vietoris–Rips filtrations, forming a $4$-D feature vector per simulation.
  • Safety Regions (SεS_\varepsilon): Adjustable SVM classifiers with probabilistic (ε\varepsilon-error) guarantees; global rules learned via SkopeRules; local anchors specify high-fidelity boundaries (e.g., meanEntropy>2.68>2.68).
  • Rule Interpretability: Rule-based descriptions yield 84.6%84.6\% accuracy for collision avoidance, 89%89\% for deadlock avoidance.
  • Trajectory-based Visualizations: Relational reasoning frameworks construct dynamic pairwise and group-wise graphs/hypergraphs to expose “who influences whom” (e.g., explicit group avoidance by convex hull) (Li et al., 2024).

These structural explanations are inherently legible and facilitate post-hoc auditing and user acceptance.

6. Limitations and Future Extensions

  • Latency and Hardware: High inference latency (remote LLMs) inhibits deployment in dense real-world crowds; efficient small-model variants mitigate this but may trade off some linguistic richness (Kawabata et al., 15 Dec 2025).
  • Generality: Most systems focus on static or semi-dynamic clusters; scalability to new environments, dynamic social norms, and agent types (e.g., bicyclists) remains a research target (Sotomi et al., 7 Apr 2025).
  • Dialogue Depth: Current dialog policies are generally shallow (single or two-turn exchanges); richer, multi-turn dialog systems and user adaptation mechanisms are under exploration (Wen et al., 2024).
  • User Modeling: Probabilistic planning frameworks offer principled adaptation to user explanation preferences but require accurate priors or online learning for effective performance in heterogenous populations (Halilovic et al., 2024).

7. Representative Datasets and Benchmarks

The SNEI benchmark dataset (Payandeh et al., 2024) comprises 2,000 annotated still frames of human–robot interactions with 40,000 VQA pairs over five evenly represented semantic categories:

Category VQA Pairs Example
Perception 8,000 "What is directly in front of the robot?"
Prediction 8,000 "What will the person in red do next?"
Chain-of-Thought 8,000 "What reasoning should the robot perform?"
Final Action 8,000 "What should the robot do?"
Explanation 8,000 "Why is that action appropriate?"

This resource anchors model training and benchmarking for both language-based and multimodal SNEI systems.


Social robot Navigation via Explainable Interactions defines a rigorously evaluated, modular paradigm for transparent, socially compliant robot navigation, integrating VLM-based perception, saliency-driven rationalization, domain-adaptive user feedback, and formal, statistically guaranteed safety mechanisms (Sotomi et al., 7 Apr 2025, Kawabata et al., 15 Dec 2025, Toscano-Duran et al., 14 Feb 2025, Wen et al., 2024, Girgin et al., 2024, Halilovic et al., 2024, Payandeh et al., 2024).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Social robot Navigation via Explainable Interactions (SNEI).