Papers
Topics
Authors
Recent
Search
2000 character limit reached

EFM-10: Robotic Manipulation Benchmark

Updated 9 February 2026
  • EFM-10 Benchmark is a robotic manipulation benchmark that defines complex tasks requiring active perception under occlusion and uncertainty.
  • It employs a bimanual active perception strategy where one arm provides eye-in-hand vision while the other carries out force-sensitive manipulation.
  • Experimental results highlight that integrating view planning and force-feedback significantly boosts task success rates in real-world scenarios.

The EFM-10 Benchmark (Exploratory and Focused Manipulation, 10 tasks) is a real-world, robot manipulation benchmark designed to evaluate and advance autonomous systems in complex, information-limited environments. It addresses the challenge of active perception under severe visual occlusion and contact uncertainty, formalizing the Exploratory and Focused Manipulation (EFM) problem and introducing both a novel manipulation task suite and a new bimanual active perception (BAP) strategy (He et al., 2 Feb 2026).

1. Problem Formalization: Exploratory and Focused Manipulation

EFM-10 centers on the autonomous completion of manipulation tasks where critical scene information is missing due to occlusion, hidden semantics, or need for precise contact. Unlike standard tabletop scenarios, EFM-10 explicitly challenges robots to actively gather task-relevant information through viewpoint manipulation and haptic exploration.

Formally, EFM tasks are cast as partially observable Markov decision processes (POMDPs). At each timestep tt, the robot’s true state sts_t (scene geometry, object identities, contact forces) is only partially observed via oto_t, which aggregates camera views and sensor feedback. The robot selects actions at=(atv,atm)a_t = (a_t^v, a_t^m), with atva_t^v designated for the vision arm (controlling an eye-in-hand camera) and atma_t^m for the manipulation arm (which may have force/torque sensing). The goal is to reduce uncertainty and reach the task’s success condition.

2. Task Design and Taxonomy within EFM-10

EFM-10 consists of ten tasks instantiated on a real JAKA K-1 bimanual system, systematically organized into four categories that reflect varied modes of exploration and focus:

Category Representative Tasks Challenge Type
Semantically exploratory Toy-Find, Toy-Match Hidden object/semantic search
Visual occlusion Cup-Hang, Cup-Place, Box-Push Viewpoint planning under occlusion
Delicate/focused Light-Plug, Bread-Brush, Nail-Knock Precise, contact-rich manipulation
Complex (combined) Cable-Match, Charger-Plug Sequential exploration & precision

Each task is defined to enforce either the need for discovering hidden scene properties or for executing fine-grained, compliance-sensitive motor operations under partial observability (He et al., 2 Feb 2026).

3. Bimanual Active Perception (BAP) Strategy

Standard fixed or head-mounted camera approaches fail when arms or tools block the view. The BAP strategy exploits both arms: the non-operating “vision arm” provides an actively movable, eye-in-hand observation; the “manipulation arm” executes the main manipulation task and supplies force/torque readings for contact feedback. This configuration eliminates the need for custom high-DoF robot necks and leverages commodity bimanual platforms (He et al., 2 Feb 2026):

  • Vision Arm (“eye-in-hand”): Moves a camera to capture optimal, informative viewpoints and maintain visibility over manipulated regions and end-effectors.
  • Manipulation Arm: Conducts contact-rich operations (e.g., insertion, brushing) and streams high-rate 6-DoF F/T signals.

Observation and action spaces are constructed as follows:

  • State sts_t: Object/object pose, joint configs (qtv,qtm)(q^v_t, q^m_t), gripper states, true forces.
  • Observation ot=(ItH,Itv,stm,Ftm)o_t = (I^H_t, I^v_t, s^m_t, F^m_t): Head camera image ItHI^H_t, vision arm wrist image ItvI^v_t, manipulation arm state stms^m_t, and wrist force FtmF^m_t.
  • Action at=(atv,atm)a_t = (a^v_t, a^m_t): Cartesian delta commands for each arm.

4. Dataset, Experimental Protocol, and Policy Training

BAPData is the dataset underpinning EFM-10. It comprises 1,810 expert VR teleoperation demonstrations (≈13–25 s each, ∼1.8K total), collected over all tasks using a JAKA K-1 robot with head and wrist RGB-D/RGB cameras and 6-DoF force/torque sensing.

Key dataset properties:

  • Synchronized RGB images from head and two wrist cameras at 10 Hz
  • End-effector pose, gripper states, and force/torque readings per frame
  • Preprocessing: Downsampling, center-cropping, state/action normalization

Policies are trained via behavioral cloning. The primary loss is mean squared error between demonstration actions and predicted actions per observation. In some variants (e.g., GR-MG), a force-prediction auxiliary loss is incorporated to encourage utilization of force input:

L(θ)=E(o,a)Daπθ(o)22L(\theta) = \mathbb{E}_{(o, a) \sim D} \| a - \pi_\theta(o) \|^2_2

Lforce(θ)=ED[F^t+1:t+KmFt+1:t+Km22]L_{\mathrm{force}}(\theta) = \mathbb{E}_D [ \| \hat{F}^m_{t+1:t+K} - F^m_{t+1:t+K} \|^2_2 ]

Ltotal=L(θ)+λLforce(θ)L_{\mathrm{total}} = L(\theta) + \lambda L_{\mathrm{force}}(\theta)

Single-task policies are trained per task; multi-task policies (e.g., π0\pi_0, GR-MG) are trained jointly.

Evaluation is conducted on 30 randomized trials per task (He et al., 2 Feb 2026).

5. Core Experimental Findings

EFM-10 exposes the substantial benefits of BAP and active viewpoint selection:

  • Active View Efficacy: Task success rates increase dramatically when the active view captures both manipulated area and the end-effector. For example, on Toy-Match, the success rate improves from 20% (no view) to 76.7% (area+effector).
  • Policy Benchmarking: Multi-task visuomotor policies trained with BAPData’s active vision setup excel across all categories, but especially so on tasks demanding viewpoint planning and force-sensitive contact.
  • Force Sensing: Incorporating force/torque streams and training with a force-prediction auxiliary head reduces excessive forces (down 29% peak FzF_z in Light-Plug) and boosts completion rates (e.g., from 20.0% to 36.7% on Light-Plug; 76.7% to 90.0% on Bread-Brush).
  • Error Analysis: Remaining failures are attributable to insufficient semantic grounding (for “find” tasks), suboptimal view angles (occlusion tasks), or fine-grained spatial misalignment (delicate insertions). This suggests that closing the gap requires integrating improved language grounding, more sophisticated viewpoint planning, and enhanced spatial attention mechanisms (He et al., 2 Feb 2026).

6. EFM-10 in the Context of Active Perception Research

EFM-10 is distinguished by the scale and diversity of its real-world, multi-stage, occlusion-rich tasks, and its emphasis on explicit information-seeking via robotic viewpoint control. The benchmark complements and extends prior bimanual active perception efforts such as ActiveUMI (Zeng et al., 2 Oct 2025), which utilized a third, head-mounted camera but did not structure tasks to systematically interrogate exploration versus precision operations, nor establish a unified taxonomy.

Benchmark evaluations further establish that active, task-driven perception is not simply a hardware configuration detail: it is a major determinant of manipulation success, generalization, and compliance in realistic settings. Comparative analysis shows that strategies relying solely on fixed or wrist-mounted cameras exhibit poor generalization and degraded performance on long-horizon, occlusion-intensive, or compliant manipulation tasks (Zeng et al., 2 Oct 2025).

7. Open Problems and Future Directions

EFM-10, along with BAP and BAPData, constitutes a new foundation for research in embodied active perception and manipulation. Persisting open problems highlighted in benchmark evaluations include:

  • Policy-guided viewpoint selection: Formalizing and optimizing information gain (e.g., maximizing IG(ot;s)IG(o_t; s)) for the vision arm remains an open challenge.
  • Fusion of multi-modal feedback: Combining vision from active arm(s) and head, along with language input for instruction following.
  • Memory and long-horizon search: Persistent memory architectures to improve object search and multi-stage exploration.
  • Reinforcement learning and hierarchical planning: End-to-end optimization, potentially incorporating LLM-based planning to further close the EFM task loop.

Continued development and community adoption of EFM-10 are projected to enable more robust, semantically guided, and generalizable bimanual manipulation systems, directly advancing the state-of-the-art in autonomous robot perception and control (He et al., 2 Feb 2026, Zeng et al., 2 Oct 2025).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to EFM-10 Benchmark.