EFM-10: Robotic Manipulation Benchmark
- EFM-10 Benchmark is a robotic manipulation benchmark that defines complex tasks requiring active perception under occlusion and uncertainty.
- It employs a bimanual active perception strategy where one arm provides eye-in-hand vision while the other carries out force-sensitive manipulation.
- Experimental results highlight that integrating view planning and force-feedback significantly boosts task success rates in real-world scenarios.
The EFM-10 Benchmark (Exploratory and Focused Manipulation, 10 tasks) is a real-world, robot manipulation benchmark designed to evaluate and advance autonomous systems in complex, information-limited environments. It addresses the challenge of active perception under severe visual occlusion and contact uncertainty, formalizing the Exploratory and Focused Manipulation (EFM) problem and introducing both a novel manipulation task suite and a new bimanual active perception (BAP) strategy (He et al., 2 Feb 2026).
1. Problem Formalization: Exploratory and Focused Manipulation
EFM-10 centers on the autonomous completion of manipulation tasks where critical scene information is missing due to occlusion, hidden semantics, or need for precise contact. Unlike standard tabletop scenarios, EFM-10 explicitly challenges robots to actively gather task-relevant information through viewpoint manipulation and haptic exploration.
Formally, EFM tasks are cast as partially observable Markov decision processes (POMDPs). At each timestep , the robot’s true state (scene geometry, object identities, contact forces) is only partially observed via , which aggregates camera views and sensor feedback. The robot selects actions , with designated for the vision arm (controlling an eye-in-hand camera) and for the manipulation arm (which may have force/torque sensing). The goal is to reduce uncertainty and reach the task’s success condition.
2. Task Design and Taxonomy within EFM-10
EFM-10 consists of ten tasks instantiated on a real JAKA K-1 bimanual system, systematically organized into four categories that reflect varied modes of exploration and focus:
| Category | Representative Tasks | Challenge Type |
|---|---|---|
| Semantically exploratory | Toy-Find, Toy-Match | Hidden object/semantic search |
| Visual occlusion | Cup-Hang, Cup-Place, Box-Push | Viewpoint planning under occlusion |
| Delicate/focused | Light-Plug, Bread-Brush, Nail-Knock | Precise, contact-rich manipulation |
| Complex (combined) | Cable-Match, Charger-Plug | Sequential exploration & precision |
Each task is defined to enforce either the need for discovering hidden scene properties or for executing fine-grained, compliance-sensitive motor operations under partial observability (He et al., 2 Feb 2026).
3. Bimanual Active Perception (BAP) Strategy
Standard fixed or head-mounted camera approaches fail when arms or tools block the view. The BAP strategy exploits both arms: the non-operating “vision arm” provides an actively movable, eye-in-hand observation; the “manipulation arm” executes the main manipulation task and supplies force/torque readings for contact feedback. This configuration eliminates the need for custom high-DoF robot necks and leverages commodity bimanual platforms (He et al., 2 Feb 2026):
- Vision Arm (“eye-in-hand”): Moves a camera to capture optimal, informative viewpoints and maintain visibility over manipulated regions and end-effectors.
- Manipulation Arm: Conducts contact-rich operations (e.g., insertion, brushing) and streams high-rate 6-DoF F/T signals.
Observation and action spaces are constructed as follows:
- State : Object/object pose, joint configs , gripper states, true forces.
- Observation : Head camera image , vision arm wrist image , manipulation arm state , and wrist force .
- Action : Cartesian delta commands for each arm.
4. Dataset, Experimental Protocol, and Policy Training
BAPData is the dataset underpinning EFM-10. It comprises 1,810 expert VR teleoperation demonstrations (≈13–25 s each, ∼1.8K total), collected over all tasks using a JAKA K-1 robot with head and wrist RGB-D/RGB cameras and 6-DoF force/torque sensing.
Key dataset properties:
- Synchronized RGB images from head and two wrist cameras at 10 Hz
- End-effector pose, gripper states, and force/torque readings per frame
- Preprocessing: Downsampling, center-cropping, state/action normalization
Policies are trained via behavioral cloning. The primary loss is mean squared error between demonstration actions and predicted actions per observation. In some variants (e.g., GR-MG), a force-prediction auxiliary loss is incorporated to encourage utilization of force input:
Single-task policies are trained per task; multi-task policies (e.g., , GR-MG) are trained jointly.
Evaluation is conducted on 30 randomized trials per task (He et al., 2 Feb 2026).
5. Core Experimental Findings
EFM-10 exposes the substantial benefits of BAP and active viewpoint selection:
- Active View Efficacy: Task success rates increase dramatically when the active view captures both manipulated area and the end-effector. For example, on Toy-Match, the success rate improves from 20% (no view) to 76.7% (area+effector).
- Policy Benchmarking: Multi-task visuomotor policies trained with BAPData’s active vision setup excel across all categories, but especially so on tasks demanding viewpoint planning and force-sensitive contact.
- Force Sensing: Incorporating force/torque streams and training with a force-prediction auxiliary head reduces excessive forces (down 29% peak in Light-Plug) and boosts completion rates (e.g., from 20.0% to 36.7% on Light-Plug; 76.7% to 90.0% on Bread-Brush).
- Error Analysis: Remaining failures are attributable to insufficient semantic grounding (for “find” tasks), suboptimal view angles (occlusion tasks), or fine-grained spatial misalignment (delicate insertions). This suggests that closing the gap requires integrating improved language grounding, more sophisticated viewpoint planning, and enhanced spatial attention mechanisms (He et al., 2 Feb 2026).
6. EFM-10 in the Context of Active Perception Research
EFM-10 is distinguished by the scale and diversity of its real-world, multi-stage, occlusion-rich tasks, and its emphasis on explicit information-seeking via robotic viewpoint control. The benchmark complements and extends prior bimanual active perception efforts such as ActiveUMI (Zeng et al., 2 Oct 2025), which utilized a third, head-mounted camera but did not structure tasks to systematically interrogate exploration versus precision operations, nor establish a unified taxonomy.
Benchmark evaluations further establish that active, task-driven perception is not simply a hardware configuration detail: it is a major determinant of manipulation success, generalization, and compliance in realistic settings. Comparative analysis shows that strategies relying solely on fixed or wrist-mounted cameras exhibit poor generalization and degraded performance on long-horizon, occlusion-intensive, or compliant manipulation tasks (Zeng et al., 2 Oct 2025).
7. Open Problems and Future Directions
EFM-10, along with BAP and BAPData, constitutes a new foundation for research in embodied active perception and manipulation. Persisting open problems highlighted in benchmark evaluations include:
- Policy-guided viewpoint selection: Formalizing and optimizing information gain (e.g., maximizing ) for the vision arm remains an open challenge.
- Fusion of multi-modal feedback: Combining vision from active arm(s) and head, along with language input for instruction following.
- Memory and long-horizon search: Persistent memory architectures to improve object search and multi-stage exploration.
- Reinforcement learning and hierarchical planning: End-to-end optimization, potentially incorporating LLM-based planning to further close the EFM task loop.
Continued development and community adoption of EFM-10 are projected to enable more robust, semantically guided, and generalizable bimanual manipulation systems, directly advancing the state-of-the-art in autonomous robot perception and control (He et al., 2 Feb 2026, Zeng et al., 2 Oct 2025).