EFM-10: Robotic Manipulation Benchmark

Updated 9 February 2026

EFM-10 Benchmark is a robotic manipulation benchmark that defines complex tasks requiring active perception under occlusion and uncertainty.
It employs a bimanual active perception strategy where one arm provides eye-in-hand vision while the other carries out force-sensitive manipulation.
Experimental results highlight that integrating view planning and force-feedback significantly boosts task success rates in real-world scenarios.

The EFM-10 Benchmark (Exploratory and Focused Manipulation, 10 tasks) is a real-world, robot manipulation benchmark designed to evaluate and advance autonomous systems in complex, information-limited environments. It addresses the challenge of active perception under severe visual occlusion and contact uncertainty, formalizing the Exploratory and Focused Manipulation (EFM) problem and introducing both a novel manipulation task suite and a new bimanual active perception (BAP) strategy (He et al., 2 Feb 2026).

1. Problem Formalization: Exploratory and Focused Manipulation

EFM-10 centers on the autonomous completion of manipulation tasks where critical scene information is missing due to occlusion, hidden semantics, or need for precise contact. Unlike standard tabletop scenarios, EFM-10 explicitly challenges robots to actively gather task-relevant information through viewpoint manipulation and haptic exploration.

Formally, EFM tasks are cast as partially observable Markov decision processes (POMDPs). At each timestep $t$ , the robot’s true state $s_t$ (scene geometry, object identities, contact forces) is only partially observed via $o_t$ , which aggregates camera views and sensor feedback. The robot selects actions $a_t = (a_t^v, a_t^m)$ , with $a_t^v$ designated for the vision arm (controlling an eye-in-hand camera) and $a_t^m$ for the manipulation arm (which may have force/torque sensing). The goal is to reduce uncertainty and reach the task’s success condition.

2. Task Design and Taxonomy within EFM-10

EFM-10 consists of ten tasks instantiated on a real JAKA K-1 bimanual system, systematically organized into four categories that reflect varied modes of exploration and focus:

Category	Representative Tasks	Challenge Type
Semantically exploratory	Toy-Find, Toy-Match	Hidden object/semantic search
Visual occlusion	Cup-Hang, Cup-Place, Box-Push	Viewpoint planning under occlusion
Delicate/focused	Light-Plug, Bread-Brush, Nail-Knock	Precise, contact-rich manipulation
Complex (combined)	Cable-Match, Charger-Plug	Sequential exploration & precision

Each task is defined to enforce either the need for discovering hidden scene properties or for executing fine-grained, compliance-sensitive motor operations under partial observability (He et al., 2 Feb 2026).

3. Bimanual Active Perception (BAP) Strategy

Standard fixed or head-mounted camera approaches fail when arms or tools block the view. The BAP strategy exploits both arms: the non-operating “vision arm” provides an actively movable, eye-in-hand observation; the “manipulation arm” executes the main manipulation task and supplies force/torque readings for contact feedback. This configuration eliminates the need for custom high-DoF robot necks and leverages commodity bimanual platforms (He et al., 2 Feb 2026):

Vision Arm (“eye-in-hand”): Moves a camera to capture optimal, informative viewpoints and maintain visibility over manipulated regions and end-effectors.
Manipulation Arm: Conducts contact-rich operations (e.g., insertion, brushing) and streams high-rate 6-DoF F/T signals.

Observation and action spaces are constructed as follows:

State $s_t$ : Object/object pose, joint configs $(q^v_t, q^m_t)$ , gripper states, true forces.
Observation $o_t = (I^H_t, I^v_t, s^m_t, F^m_t)$ : Head camera image $I^H_t$ , vision arm wrist image $I^v_t$ , manipulation arm state $s^m_t$ , and wrist force $F^m_t$ .
Action $a_t = (a^v_t, a^m_t)$ : Cartesian delta commands for each arm.

4. Dataset, Experimental Protocol, and Policy Training

BAPData is the dataset underpinning EFM-10. It comprises 1,810 expert VR teleoperation demonstrations (≈13–25 s each, ∼1.8K total), collected over all tasks using a JAKA K-1 robot with head and wrist RGB-D/RGB cameras and 6-DoF force/torque sensing.

Key dataset properties:

Synchronized RGB images from head and two wrist cameras at 10 Hz
End-effector pose, gripper states, and force/torque readings per frame
Preprocessing: Downsampling, center-cropping, state/action normalization

Policies are trained via behavioral cloning. The primary loss is mean squared error between demonstration actions and predicted actions per observation. In some variants (e.g., GR-MG), a force-prediction auxiliary loss is incorporated to encourage utilization of force input:

$L(\theta) = \mathbb{E}_{(o, a) \sim D} \| a - \pi_\theta(o) \|^2_2$

$L_{\mathrm{force}}(\theta) = \mathbb{E}_D [ \| \hat{F}^m_{t+1:t+K} - F^m_{t+1:t+K} \|^2_2 ]$

$L_{\mathrm{total}} = L(\theta) + \lambda L_{\mathrm{force}}(\theta)$

Single-task policies are trained per task; multi-task policies (e.g., $\pi_0$ , GR-MG) are trained jointly.

Evaluation is conducted on 30 randomized trials per task (He et al., 2 Feb 2026).

5. Core Experimental Findings

EFM-10 exposes the substantial benefits of BAP and active viewpoint selection:

Active View Efficacy: Task success rates increase dramatically when the active view captures both manipulated area and the end-effector. For example, on Toy-Match, the success rate improves from 20% (no view) to 76.7% (area+effector).
Policy Benchmarking: Multi-task visuomotor policies trained with BAPData’s active vision setup excel across all categories, but especially so on tasks demanding viewpoint planning and force-sensitive contact.
Force Sensing: Incorporating force/torque streams and training with a force-prediction auxiliary head reduces excessive forces (down 29% peak $F_z$ in Light-Plug) and boosts completion rates (e.g., from 20.0% to 36.7% on Light-Plug; 76.7% to 90.0% on Bread-Brush).
Error Analysis: Remaining failures are attributable to insufficient semantic grounding (for “find” tasks), suboptimal view angles (occlusion tasks), or fine-grained spatial misalignment (delicate insertions). This suggests that closing the gap requires integrating improved language grounding, more sophisticated viewpoint planning, and enhanced spatial attention mechanisms (He et al., 2 Feb 2026).

6. EFM-10 in the Context of Active Perception Research

EFM-10 is distinguished by the scale and diversity of its real-world, multi-stage, occlusion-rich tasks, and its emphasis on explicit information-seeking via robotic viewpoint control. The benchmark complements and extends prior bimanual active perception efforts such as ActiveUMI (Zeng et al., 2 Oct 2025), which utilized a third, head-mounted camera but did not structure tasks to systematically interrogate exploration versus precision operations, nor establish a unified taxonomy.

Benchmark evaluations further establish that active, task-driven perception is not simply a hardware configuration detail: it is a major determinant of manipulation success, generalization, and compliance in realistic settings. Comparative analysis shows that strategies relying solely on fixed or wrist-mounted cameras exhibit poor generalization and degraded performance on long-horizon, occlusion-intensive, or compliant manipulation tasks (Zeng et al., 2 Oct 2025).

7. Open Problems and Future Directions

EFM-10, along with BAP and BAPData, constitutes a new foundation for research in embodied active perception and manipulation. Persisting open problems highlighted in benchmark evaluations include:

Policy-guided viewpoint selection: Formalizing and optimizing information gain (e.g., maximizing $IG(o_t; s)$ ) for the vision arm remains an open challenge.
Fusion of multi-modal feedback: Combining vision from active arm(s) and head, along with language input for instruction following.
Memory and long-horizon search: Persistent memory architectures to improve object search and multi-stage exploration.
Reinforcement learning and hierarchical planning: End-to-end optimization, potentially incorporating LLM-based planning to further close the EFM task loop.

Continued development and community adoption of EFM-10 are projected to enable more robust, semantically guided, and generalizable bimanual manipulation systems, directly advancing the state-of-the-art in autonomous robot perception and control (He et al., 2 Feb 2026, Zeng et al., 2 Oct 2025).

Markdown Upgrade to Chat

References (2)

Towards Exploratory and Focused Manipulation with Bimanual Active Perception: A New Problem, Benchmark and Strategy (2026)

ActiveUMI: Robotic Manipulation with Active Perception from Robot-Free Human Demonstrations (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to EFM-10 Benchmark.

EFM-10: Robotic Manipulation Benchmark

1. Problem Formalization: Exploratory and Focused Manipulation

2. Task Design and Taxonomy within EFM-10

3. Bimanual Active Perception (BAP) Strategy

4. Dataset, Experimental Protocol, and Policy Training

5. Core Experimental Findings

6. EFM-10 in the Context of Active Perception Research

7. Open Problems and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

EFM-10: Robotic Manipulation Benchmark

1. Problem Formalization: Exploratory and Focused Manipulation

2. Task Design and Taxonomy within EFM-10

3. Bimanual Active Perception (BAP) Strategy

4. Dataset, Experimental Protocol, and Policy Training

5. Core Experimental Findings

6. EFM-10 in the Context of Active Perception Research

7. Open Problems and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research