Embodiment-aware functional reasoning in VLMs

Determine how to achieve reliable, inclusive embodiment-aware functional reasoning in Vision-Language Models across diverse agent profiles—specifically Adult, Child, and Wheelchair user—so that performance differences across these embodiments do not persist even after task decomposition into atomic actions in 3D scenes.

Background

The paper benchmarks multiple Vision-LLMs (VLMs) on geometry-verified feasibility judgments of tasks and atomic actions in 3D indoor scenes under explicit agent profiles (Adult, Child, Wheelchair user). While action-level decomposition improves many models’ functional reasoning, substantial disparities remain across embodiments.

This persistent gap motivates an open problem around embodiment-aware reasoning: designing or identifying approaches that yield consistent performance across different physical capabilities and constraints, ensuring models do not systematically underperform for certain agent profiles even when tasks are broken down into simpler, grounded steps.

References

Finally, inclusivity remains non-trivial across all systems, as performance differences across Adult, Child, and Wheelchair user profiles persist even after decomposition, highlighting that embodiment-aware functional reasoning remains an open challenge.

SceneTeract: Agentic Functional Affordances and VLM Grounding in 3D Scenes  (2603.29798 - Maillard et al., 31 Mar 2026) in Section 4.3, Results Overview