Embodied Capability Taxonomy

Updated 22 September 2025

Embodied capability taxonomy is a hierarchical framework that organizes and evaluates core perceptual, cognitive, motoric, and social abilities in embodied AI.
It employs modular benchmarks, unified scoring, and composite metrics to rigorously assess performance across diverse sensory and task domains.
The framework bridges theory and practice by guiding the design of scalable, interoperable agents with automated data generation and dynamic challenge calibration.

Embodied capability taxonomy is a hierarchical and multi-dimensional framework for organizing, specifying, evaluating, and standardizing the core perceptual, cognitive, motoric, and social abilities required by embodied AI systems to perceive, reason, and act within physical or simulated environments. This concept plays a foundational role in bridging theory and practice for the development of scalable, modular, generalizable, and interoperable embodied agents. Recent research advances have focused on formalizing such taxonomies to unify research objectives, enable rigorous model comparison, support automated benchmarking and evaluation, and guide the design of future embodied intelligence platforms (Ni et al., 18 Sep 2025).

1. Definitional Structure and Hierarchies

Contemporary embodied capability taxonomies, as implemented in large-scale evaluation platforms, are generally organized hierarchically across three levels (Ni et al., 18 Sep 2025):

Perception – The ability to process raw sensory signals (visual, spatial, temporal) from the environment.
Reasoning – The capacity to integrate sensory information with prior knowledge for understanding, inference, and prediction.
Task Execution – The capacity for goal-directed behavior, complex planning, and organizational skills linking reasoning and motoric action.

Within this tripartite architecture, seven core embodied capabilities are established:

Level	Core Capability	Fine-Grained Dimensions
Perception	Object Perception	Type, Property, State, Count
	Spatial Perception	Relationship, Distance, Localization, Size
	Temporal Perception	Description, Order
Reasoning	Embodied Knowledge	General Knowledge, Affordance Prediction
	Embodied Reasoning	Object, Spatial, Temporal, Knowledge, Task Reasoning
Task Execution	Embodied Navigation	Object, Location, Instruction Navigation
	Embodied Task Planning	Basic, Visual Reference, Spatial Reference, Temporal Reference, Knowledge Reference Planning

Each capability is further decomposed into concrete, measurable sub-dimensions, enabling granular benchmarking and diagnosis (Ni et al., 18 Sep 2025).

2. Evaluation, Benchmarking, and Standardization

Taxonomy-driven evaluation systems map embodied capabilities to standardized metrics and scenarios by integrating diverse benchmarks into a unified infrastructure (Ni et al., 18 Sep 2025, Gao et al., 11 Jun 2025). Standardization is ensured through:

Modular Benchmark Integration: Coordination of task evaluation across question answering, navigation, and task planning domains.
Uniform Input/Output Formats: Ensuring reproducibility and fair cross-domain comparisons.
Capability-Indexed Scoring: For a capability dimension $m$ on benchmark $n$ , the normalized score is $S_m^n = (c_m^n / k_m^n) \times 100$ , where $c_m^n$ is correct answers and $k_m^n$ is total instances.
Composite Metrics: Success Rate, SPL (Success weighted by Path Length), and customized reasoning or planning accuracies are used as primary evaluation axes.

Leaderboards using both benchmark and capability views allow holistic and task-specific model assessment. Findings from such evaluations highlight performance bottlenecks, reveal scaling law effects, and correlate core perceptual skills with downstream task generalization (Ni et al., 18 Sep 2025).

3. Automated Data Generation and Evolution

Critical to the maintenance and scale-up of taxonomical evaluation is automated data generation. Embodied Arena employs a two-stage, LLM-driven pipeline (Ni et al., 18 Sep 2025):

Scenario Generation: Hierarchical construction of simulation environments via floor planning, functional zoning, and randomized layout generation, ensuring real-world relevance and domain diversity.
Capability-Oriented Data Synthesis: Procedural generation of visual-instruction-answer triplets for each capability and dimension, leveraging simulation privileges for ground-truth extraction and supporting continuous data evolution.

Adaptive “difficulty ladders” and targeted generation driven by real-time model weaknesses enable ongoing challenge calibration and curriculum-based evaluation.

4. Integration Across Embodied Task Domains

Embodied capability taxonomy bridges traditionally siloed research communities. The taxonomy accommodates:

Object Perception and Reasoning: Including fine-grained recognition, spatial and temporal relationships, and object affordance understanding, crucial for Q&A and object-centric manipulation tasks (Ni et al., 18 Sep 2025).
Navigation: Generalist agents are benchmarked for free-form, multi-modal instruction following, integrating flexibility across PointNav, ObjectNav, Vision-Language Navigation, and composite settings (Gao et al., 11 Jun 2025, Zhang et al., 15 Sep 2025).
Planning and Execution: Hierarchically, from low-level primitives (atomic “move”, “grasp”) to parameterized skills and complex multi-step planning, especially relevant in industrial automation and compositional task domains (Pantano et al., 2022, Ni et al., 18 Sep 2025).
Memory and Social Intelligence: Long-term multimodal memory, episodic and semantic memory integration, and lifelong social interaction are newly-incorporated capability axes (Yadav et al., 18 Jun 2025, Zhang et al., 30 Jun 2025).

The taxonomy flexibly supports specification, evaluation, and modular expansion across these domains.

5. Taxonomy-Driven Research Findings and Implications

Large-scale empirical evaluations using capability taxonomies yield several key findings (Ni et al., 18 Sep 2025):

Advanced reasoning performance is tightly coupled to fundamental perceptual skills—suggesting that investments in object and spatial perception robustly transfer to higher-level reasoning.
Specialized, task-tuned models outperform scaled generalist models on specific capability dimensions—even when large general models show dominance in overall averages.
Multi-task, cross-domain evaluation is necessary to avoid overfitting and exposes weaknesses in models trained solely on single benchmarks.

The taxonomy also serves as an ongoing roadmap: new capabilities (e.g., compositional memory, causal reasoning, adaptive planning) are incrementally incorporated as research progresses.

6. Technical Formalisms and Practical Frameworks

Embodied capability taxonomies are underpinned by formal mathematical and data structure frameworks:

Explicit Planning Models: Problem tuples defined as $\Phi = \{ S, A, s_{\text{ini}}, s_{\text{goal}} \}$ with solution trajectories $\psi = [s_{\text{ini}}, a_0, \ldots, s_t, a_t, \ldots, a_T, s_{\text{goal}}]$ where each $s_t$ incorporates multimodal observations (Francis et al., 2021, Francis et al., 2023).
Hierarchy of Tasks, Skills, Primitives: Abstract-to-concrete decomposition with symbolic parameterization, e.g., $\texttt{Pick\_and\_Place(Shaft, Housing)}$ (Pantano et al., 2022).
Normalized Scoring and Benchmark-Agnostic APIs: Ensuring reproducibility and cross-capability transferability.
Resource-aware Sampling: Modern capability taxonomies integrate token budget–aware observation sampling and temporal-viewpoint encoding to maintain efficiency and broad applicability (Zhang et al., 15 Sep 2025).

Such frameworks support the instantiation and scaling of taxonomies in both simulated and real-world deployments.

7. Future Directions and Open Problems

Emerging taxonomies highlight several directions:

Incremental expansion to new capability dimensions: e.g., creative reasoning, open-ended dialogue, persistent memory, and theory-of-mind representation.
Dynamic curriculum evolution: Data and evaluation difficulty adapt in real time to the competency of advancing agents, maintaining assessment relevance at the frontier.
Integration of policy, ethics, and risk assessment: Taxonomies increasingly include axes for physical, informational, economic, and social risk capabilities (Perlo et al., 28 Aug 2025), with corresponding policy implications for safety and governance.
Universal action and perception primitives: Progress toward shared atomic behaviors and observation modalities enables rapid adaptation and transfer learning across robots and environments (Zheng et al., 17 Jan 2025).

Plausibly, the taxonomy itself will evolve iteratively in parallel with capability advances in embodied AI systems and as the field confronts richer, open-world, and multi-agent interaction settings.

Embodied capability taxonomy provides the organizing scaffold for coherent progress in embodied AI—specifying, benchmarking, diagnosing, and ultimately driving human-level intelligence in physically and socially embedded systems. Its hierarchical, modular, and extensible character ensures continued adaptability as the technical and scientific horizons of the field rapidly advance.