Social Robot Navigation via Explainable Interactions

Updated 20 January 2026

The paper introduces a multimodal framework that integrates vision-language models, reinforcement learning, and topological planning to deliver real-time explainable robot navigation.
It employs specialized datasets and joint loss functions for fine-tuning, resulting in enhanced trust, reduced conflicts, and improved navigation efficiency.
The system generates saliency heatmaps and natural language explanations to ensure transparent, socially compliant interactions in diverse, dynamic environments.

Social Robot Navigation via Explainable Interactions (SNEI) unifies autonomous mobile robot navigation with real-time, human-interpretable explanations, leveraging advances in vision-LLMs (VLMs), reinforcement learning, and multimodal interaction. SNEI operationalizes robot transparency by integrating natural language explanations and visual rationales into the navigation stack, aiming to enhance human trust, predictability, and social compliance in diverse dynamic environments. Recent literature documents its formal architectures, mathematical grounding, user-centered validation, and real-world deployment across multiple paradigms (Sotomi et al., 7 Apr 2025, Kawabata et al., 15 Dec 2025, Girgin et al., 2024).

1. Core Architectural Principles

SNEI is structurally defined by its coupling of perception, explanation, and planning via a multimodal, three-stage pipeline (Sotomi et al., 7 Apr 2025):

Perception: The robot ingests sensory data—typically forward-facing RGB images (or combined RGB-LiDAR)—processed through a fine-tuned vision-LLM (e.g., BLIP) to yield dense visual captions (L) and spatial saliency heatmaps (H) via Grad-CAM. Formally, for image $X$ , the caption and heatmap are

$L = f_\text{caption}(X),\;\; H_{i,j} = \operatorname{ReLU}\left(\sum_k \alpha_k A_c^{i,j}\right).$

Multimodal Fusion & Explanation: The tuple $(X, H, L)$ is fused via feature extraction and spatial pooling (e.g., $\psi(X,H) = \operatorname{MLP}(\text{concat}(\operatorname{Pool}(H), \text{ExtractFeatures}(X)))$ ) and mapped to a concise explanation $E$ by a LLM $\Phi$ :

$E = \Phi(\psi(X, H), L).$

Navigation Integration: The socially aware path planner (TEB/ORCA) triggers the explanation pipeline when social conflicts, encoded by $h(P, q_\text{human}^j) \leq 0$ , are detected, and dynamically replans while streaming $E$ to the user.

Explanations are generated in real-time, aligning the robot’s instantaneous “attention” (via $H$ ) and decision (via $L = f_\text{caption}(X),\;\; H_{i,j} = \operatorname{ReLU}\left(\sum_k \alpha_k A_c^{i,j}\right).$ 0) with concise, context-sensitive summaries (Sotomi et al., 7 Apr 2025).

2. Model Training, Datasets, and Fine-Tuning

SNEI systems are trained on specialized datasets that capture visual, spatial, and social cues in human-robot encounters.

Navigation Explanation Dataset: Approximately $L = f_\text{caption}(X),\;\; H_{i,j} = \operatorname{ReLU}\left(\sum_k \alpha_k A_c^{i,j}\right).$ 1 images from indoor spaces, each labeled with:
- Annotated human/group positions,
- Binary conflict flags,
- Free-form social explanations (Sotomi et al., 7 Apr 2025).
Fine-Tuning Procedures:
- BLIP (VLM): Pretrained on MSCOCO/Flickr30k; further trained for $L = f_\text{caption}(X),\;\; H_{i,j} = \operatorname{ReLU}\left(\sum_k \alpha_k A_c^{i,j}\right).$ 2 epochs with cross-entropy caption loss
$L = f_\text{caption}(X),\;\; H_{i,j} = \operatorname{ReLU}\left(\sum_k \alpha_k A_c^{i,j}\right).$ 3 - Heatmap Branch: Mean squared error against human-annotated saliency maps,

$L = f_\text{caption}(X),\;\; H_{i,j} = \operatorname{ReLU}\left(\sum_k \alpha_k A_c^{i,j}\right).$ 4 - Joint Loss:

$L = f_\text{caption}(X),\;\; H_{i,j} = \operatorname{ReLU}\left(\sum_k \alpha_k A_c^{i,j}\right).$ 5 - LLM ( $L = f_\text{caption}(X),\;\; H_{i,j} = \operatorname{ReLU}\left(\sum_k \alpha_k A_c^{i,j}\right).$ 6): Prompt-tuned on $L = f_\text{caption}(X),\;\; H_{i,j} = \operatorname{ReLU}\left(\sum_k \alpha_k A_c^{i,j}\right).$ 7 joint representations using token-level cross-entropy.
SNEI Dataset (Broader Context): $L = f_\text{caption}(X),\;\; H_{i,j} = \operatorname{ReLU}\left(\sum_k \alpha_k A_c^{i,j}\right).$ 8 annotated scenarios, $L = f_\text{caption}(X),\;\; H_{i,j} = \operatorname{ReLU}\left(\sum_k \alpha_k A_c^{i,j}\right).$ 9 VQA pairs with even splits over perception, prediction, chain-of-thought, action, and explanation; includes scenario metadata (environment type, crowd density, etc.) (Payandeh et al., 2024).

Alternative architectures utilize mixture-of-experts models (e.g., SocialNav-MoE (Kawabata et al., 15 Dec 2025)) and reinforcement fine-tuning pipelines with semantic similarity rewards for efficient, human-compliant navigation.

3. Explainability Mechanisms and Interaction Modalities

SNEI explicitly integrates explainability at several levels:

Visual Explanations: Saliency heatmaps ( $(X, H, L)$ 0) spatially ground the robot’s focus, visually highlighting influential regions in the scene that drive decision-making.
Natural Language Explanations: Generated by LLMs, these summarize robot actions or trajectory changes ("I see a small gathering of people ahead engaged in discussion; let me reroute to give them space" (Sotomi et al., 7 Apr 2025)).
Token and Expert-activation Tracing: In MoE-based models, activated expert pathways and token-level cross-attention can be visualized for post-hoc interpretability (Kawabata et al., 15 Dec 2025).
Gesture and Dialogue Looping: Some frameworks incorporate bidirectional voice and gesture negotiation, where humans can guide robot choices through hand gestures interpreted by models such as Mediapipe; robots adjust behavior and provide verbal acknowledgments (Girgin et al., 2024).
Probabilistic Planning of Explanations: Explanation modalities, detail, duration, and scope are treated as stochastic variables in MDP (RDDL) planners to optimize satisfaction of diverse user preferences (Halilovic et al., 2024).

Real-time latency remains a limiting factor: end-to-end pipelines with remote LLMs currently report average latencies around $(X, H, L)$ 1 per explanation on embedded hardware (Sotomi et al., 7 Apr 2025), motivating research into compact local models (Kawabata et al., 15 Dec 2025).

4. Evaluation Protocols and Empirical Results

SNEI is validated through a combination of user studies, behavioral metrics, and confusion-matrix-based analysis.

User Studies and Metrics:

Demographics: 30 participants, observed four robot runs (with/without SNEI, manual/autonomous modes) (Sotomi et al., 7 Apr 2025).
Survey Items: Trust, clarity, understanding, perceived safety (5-point Likert scale).
Preference Score (PS):

$(X, H, L)$ 2

where $(X, H, L)$ 3 users preferring explanations, $(X, H, L)$ 4 neutral, $(X, H, L)$ 5 total.

Results:

Navigation:
- Slight decrease in trajectory length/time with SNEI; reduction in abrupt stops (21 $(X, H, L)$ 6 18) and number of detected conflicts (Sotomi et al., 7 Apr 2025).
Subjective Outcomes:
- Trust: increased by $(X, H, L)$ 7.
- Understanding: increased by $(X, H, L)$ 8.
- Preference for real-time explanations: $(X, H, L)$ 9 (vs. $\psi(X,H) = \operatorname{MLP}(\text{concat}(\operatorname{Pool}(H), \text{ExtractFeatures}(X)))$ 0 without).
Explainability Accuracy:
- Accuracy: $\psi(X,H) = \operatorname{MLP}(\text{concat}(\operatorname{Pool}(H), \text{ExtractFeatures}(X)))$ 1; Precision: $\psi(X,H) = \operatorname{MLP}(\text{concat}(\operatorname{Pool}(H), \text{ExtractFeatures}(X)))$ 2; Recall: $\psi(X,H) = \operatorname{MLP}(\text{concat}(\operatorname{Pool}(H), \text{ExtractFeatures}(X)))$ 3 (on held-out explanations).
Qualitative Explanations: Users consistently rated explanations as helping them understand and predict robot choices.
Comparative Performance: SocialNav-MoE achieved SMS (sentence-mover’s similarity) of $\psi(X,H) = \operatorname{MLP}(\text{concat}(\operatorname{Pool}(H), \text{ExtractFeatures}(X)))$ 4 vs. $\psi(X,H) = \operatorname{MLP}(\text{concat}(\operatorname{Pool}(H), \text{ExtractFeatures}(X)))$ 5 (GPT-4o) and $\psi(X,H) = \operatorname{MLP}(\text{concat}(\operatorname{Pool}(H), \text{ExtractFeatures}(X)))$ 6 (Claude), while operating $\psi(X,H) = \operatorname{MLP}(\text{concat}(\operatorname{Pool}(H), \text{ExtractFeatures}(X)))$ 7 to $\psi(X,H) = \operatorname{MLP}(\text{concat}(\operatorname{Pool}(H), \text{ExtractFeatures}(X)))$ 8 faster on action inference (Kawabata et al., 15 Dec 2025).

5. Topological and Relational Explainability

Some SNEI pipelines define collision-free and deadlock-free "safety regions" using topological features of multi-robot trajectories (Toscano-Duran et al., 14 Feb 2025):

Topological Feature Extraction: Persistent entropy over Vietoris–Rips filtrations, forming a $\psi(X,H) = \operatorname{MLP}(\text{concat}(\operatorname{Pool}(H), \text{ExtractFeatures}(X)))$ 9-D feature vector per simulation.
Safety Regions ( $E$ 0): Adjustable SVM classifiers with probabilistic ( $E$ 1-error) guarantees; global rules learned via SkopeRules; local anchors specify high-fidelity boundaries (e.g., meanEntropy $E$ 2).
Rule Interpretability: Rule-based descriptions yield $E$ 3 accuracy for collision avoidance, $E$ 4 for deadlock avoidance.
Trajectory-based Visualizations: Relational reasoning frameworks construct dynamic pairwise and group-wise graphs/hypergraphs to expose “who influences whom” (e.g., explicit group avoidance by convex hull) (Li et al., 2024).

These structural explanations are inherently legible and facilitate post-hoc auditing and user acceptance.

6. Limitations and Future Extensions

Latency and Hardware: High inference latency (remote LLMs) inhibits deployment in dense real-world crowds; efficient small-model variants mitigate this but may trade off some linguistic richness (Kawabata et al., 15 Dec 2025).
Generality: Most systems focus on static or semi-dynamic clusters; scalability to new environments, dynamic social norms, and agent types (e.g., bicyclists) remains a research target (Sotomi et al., 7 Apr 2025).
Dialogue Depth: Current dialog policies are generally shallow (single or two-turn exchanges); richer, multi-turn dialog systems and user adaptation mechanisms are under exploration (Wen et al., 2024).
User Modeling: Probabilistic planning frameworks offer principled adaptation to user explanation preferences but require accurate priors or online learning for effective performance in heterogenous populations (Halilovic et al., 2024).

7. Representative Datasets and Benchmarks

The SNEI benchmark dataset (Payandeh et al., 2024) comprises 2,000 annotated still frames of human–robot interactions with 40,000 VQA pairs over five evenly represented semantic categories:

Category	VQA Pairs	Example
Perception	8,000	"What is directly in front of the robot?"
Prediction	8,000	"What will the person in red do next?"
Chain-of-Thought	8,000	"What reasoning should the robot perform?"
Final Action	8,000	"What should the robot do?"
Explanation	8,000	"Why is that action appropriate?"

This resource anchors model training and benchmarking for both language-based and multimodal SNEI systems.

Social robot Navigation via Explainable Interactions defines a rigorously evaluated, modular paradigm for transparent, socially compliant robot navigation, integrating VLM-based perception, saliency-driven rationalization, domain-adaptive user feedback, and formal, statistically guaranteed safety mechanisms (Sotomi et al., 7 Apr 2025, Kawabata et al., 15 Dec 2025, Toscano-Duran et al., 14 Feb 2025, Wen et al., 2024, Girgin et al., 2024, Halilovic et al., 2024, Payandeh et al., 2024).