Dynamic Object Manipulation Benchmark
- Dynamic Object Manipulation (DOM) is a benchmark that standardizes real-time evaluation of robotic manipulation using synchronized multi-modal data and rigorous protocols.
- It integrates large-scale synthetic and real-world datasets to assess closed-loop control, temporal anticipation, and dynamic adaptation under varied motion conditions.
- DOM advances Vision-Language-Action research by providing precise metrics and a structured taxonomy for challenges like fast object motion and sensory latency.
The Dynamic Object Manipulation (DOM) benchmark is a standardized evaluation framework for robotic systems tasked with manipulating moving objects under dynamic, real-time conditions. DOM specifies rigorous protocols, diverse datasets, and temporally aligned multi-modal streams designed to expose the challenges intrinsic to perception, anticipation, and continuous closed-loop control in dynamic manipulation scenarios (Xie et al., 29 Jan 2026). The benchmark enables comparative study and systematic advancement of Vision-Language-Action (VLA) policy architectures, with a well-defined set of metrics, sub-dimensions, and use-cases.
1. Benchmark Objectives and Task Taxonomy
DOM’s core purpose is to serve as a testbed for robotic policies requiring:
- Dynamic perception: Rapidly interpret per-frame object motion (6D pose and velocity) and evolving spatial relations.
- Temporal anticipation: Predict future object states to compensate for computational and sensing latency.
- Closed-loop control: Generate and execute continuous, low-latency action streams that remain responsive to changes in object dynamics.
DOM defines nine sub-dimensions along three axes—Interaction, Perception, and Generalization:
| Pillar | Sub-dimension | Operational Focus |
|---|---|---|
| Interaction | Closed-loop Reactivity (CR) | Track and catch varying-speed objects |
| Interaction | Dynamic Adaptation (DA) | Respond to sudden velocity/direction changes |
| Interaction | Long-horizon Sequencing (LS) | Coordinate repeated catch-and-place cycles |
| Perception | Visual Understanding (VU) | Recognize similar objects under motion |
| Perception | Spatial Reasoning (SR) | Infer object relations in clutter |
| Perception | Motion Perception (MP) | Estimate speed and trajectory in real time |
| Generalization | Visual Generalization (VG) | Transfer to novel shapes and textures |
| Generalization | Motion Generalization (MG) | Generalize to new velocity/friction regimes |
| Generalization | Disturbance Robustness (DR) | Operate under external perturbations |
This taxonomy enables systematic stress-testing across physically and semantically diverse manipulation scenarios (Xie et al., 29 Jan 2026).
2. Data Collection Pipeline
DOM data is generated via large-scale simulation and teleoperation-free real-world setups, both orchestrated by a uniform four-stage state-machine controller:
- Synthetic Data (200 K episodes):
- Environments: 2.8 K tabletop scenes (3D-FRONT), pruned for physical plausibility.
- Objects: 206 everyday items (Objeverse), with randomized textures.
- Dynamics: Object speeds sampled from [0, 0.75] m/s, friction coefficients from [0.5, 1.5], multiple objects per scene.
- Sensors: Three RGB cameras (two external, one wrist-mounted), 25 FPS, 480×360, Azure Kinect intrinsics, randomized illumination parameters.
- Ground Truth: 6D pose and velocity at 25 Hz via Isaac Sim.
- Procedure: For each episode—(1) approach/predict object, (2) grasp and lift, (3) place, (4) reset.
- Real-World Data (2 K episodes):
- Objects: 25 household items, both target and distractor.
- Sensors: Two Azure Kinect DK and one wrist-mounted RealSense D435i, all calibrated.
- State Estimation: EfficientTAM for segmentation, geometric triangulation for 3D centroids, temporal fit for velocities (~25 Hz state stream).
- Protocol: Human only initiates object motion; remainder is fully automated, matching simulated controller logic.
This dual-source pipeline provides both diversity and fidelity, with real-world streams mirroring simulation for direct transfer and evaluation (Xie et al., 29 Jan 2026).
3. Dataset Structure and Annotations
Each DOM episode is annotated with temporally aligned, multi-modal streams:
- Multi-view RGB: Synchronized images from multiple perspectives at each timestep . Images (e.g., at and ) are concatenated for input.
- Natural Language Prompt: Instructional text (e.g., "Place the rolling soda can into the wooden box").
- Proprioceptive State: 32-dimensional vector encoding the robot end-effector pose.
- Action Stream: For each , a chunk , with each a 32-dimensional action (6D pose delta + gripper). Each action is explicitly labeled with its execution timestep .
- Object State: Time-stamped 6D pose and velocity for all objects at 25 Hz.
Action/event alignment accounts for inference latency : an action chunk produced at is executed at . DOM provides granular ground-truth trajectories for precise analysis of anticipation and realignment behaviors (Xie et al., 29 Jan 2026).
4. Evaluation Protocol and Metrics
DOM specifies quantitative metrics for assessing model performance under kinetic and computational constraints:
- Success Rate (SR):
- Reaction Latency (RL): Time from object “event” (e.g., bounce/change) to first policy-corrective action.
- Trajectory/Temporal Anticipation Error (TAE):
where is the model’s prediction and ground truth.
- Path Length: Total end-effector path (meters).
- Completion Time: Elapsed time from object motion onset to episode termination.
Benchmark splits include: training (synthetic subset), validation (held-out synthetic scenes/objects), synthetic test (1.8K episodes from unseen scenes), and real-world test (20 episodes per configuration, across both Franka and PiPER arms). Generalization is stress-tested on unseen objects (VG), novel dynamic regimes (MG), and perturbations (DR) (Xie et al., 29 Jan 2026).
5. Core Challenges and Research Use-Cases
DOM exposes foundational challenges in dynamic manipulation:
- Fast Object Motion: Up to 0.75 m/s object speeds with required perception–action cycle under 50 ms.
- Temporal Misalignment: Nonzero inference delay () induces perception–execution synchronization errors, addressed via anticipation (e.g., Latent-aware Action Streaming).
- Continuous Control: Policies must support overlapping inference and execution (Continuous Inference), as opposed to conventional chunked, open-loop commands.
- Long-horizon Coherence: Robustness over sequential, drifting catch–place cycles.
- Multimodal Reasoning: Integration of language, visual flow, and proprioceptive feedback in real time.
The benchmark’s design allows researchers to:
- Evaluate VLA model response speed, perception, and adaptation.
- Investigate trade-offs between model size, inference frequency, and low-latency stability.
- Develop and analyze latency-aware algorithms with evaluation of temporal anticipation and realignment heuristics.
- Statistically assess cross-domain generalization performance (Xie et al., 29 Jan 2026).
6. Comparative Context and Extensions
The DOM benchmark fundamentally differs from prior dynamic or deformable manipulation datasets by combining:
- Large-scale (200K synthetic + 2K real episodes) annotated 6D dynamic data,
- Automated, protocol-matched synthetic and real-world pipelines,
- Structured, temporally granular action labeling for real-time closed-loop policy analysis.
Contrasted with deformable manipulation benchmarks such as DaXBench—which targets fluids, ropes, and cloth with differentiable simulators (Chen et al., 2022)—DOM focuses on dynamic rigid object scenarios, requiring precise perception, anticipation, and low-latency control, rather than model-based gradient optimization or high-DoF state estimation as in reduced-order rope/cloth manipulation (Lan et al., 23 May 2025). A plausible implication is that DOM enables rigorous evaluation of perception- and latency-limited policies for dynamic manipulation tasks not readily addressed by differentiable-physics-focused or quasi-static benchmarks.
By delineating modality, protocol, and control stream synchrony, DOM provides a reproducible, extensible foundation for the study of real-time dynamic manipulation under Vision-Language-Action paradigms (Xie et al., 29 Jan 2026).