Papers
Topics
Authors
Recent
Search
2000 character limit reached

AutoDex: Automated Dexterous Grasping Data

Updated 4 July 2026
  • AutoDex is an automated system that collects dexterous grasping data through a closed-loop process using dense multi-camera perception and collision-monitored robot motions.
  • The system integrates a modular candidate generator and active object reset to efficiently capture both successful and failed grasp trials, mitigating teleoperation biases.
  • Experimental results show a 4.8x throughput improvement and a 76% success rate over simulation-only methods, emphasizing the importance of real-world validation.

AutoDex is an automated real-world system for dexterous grasping data collection that closes the collection loop without human intervention: for each candidate grasp from a replaceable generator, it localizes the object under severe hand-object occlusion with dense 20-camera perception, executes collision-monitored robot motions, labels lift-and-hold success or failure, and actively resets the object between trials to expose additional candidates across stable poses. The system is designed to produce a reusable database of physically labeled grasp trials, motivated by the observation that teleoperation is slow and operator-biased, while simulation-based generation is cheap and scalable but cannot certify contact validity. Using AutoDex, the reported collection comprises 3,593 grasp trials across Allegro and Inspire hands on 100 diverse objects, with synchronized multi-view observations and robot-state logs; for a matched 500-trajectory collection, AutoDex requires 10.3 h versus 49.4 h for teleoperation, and grasps retrieved from the AutoDex-validated database succeed 76% versus 34% for simulation-only validation (Choi et al., 22 Jun 2026).

1. Conceptual scope and motivation

AutoDex addresses a data bottleneck in dexterous grasping: robust learning requires real-world data that records the physical outcomes of grasp attempts, but such data is difficult to obtain at scale. The central argument is that multi-finger grasps depend on subtle contact physics, including friction, compliance, finger-pad deformation, slip, and force distribution, so a grasp that is collision-free and kinematically feasible in a planner or simulator may still fail on hardware (Choi et al., 22 Jun 2026).

The system is explicitly positioned between two existing data-collection paradigms. Teleoperation provides ground-truth real-world outcomes, but it is described as slow, fatigue-limited, and biased by the operator’s preferences in approach direction and hand configuration. Simulation-only validation is cheaper and more scalable, but it cannot reliably certify real contact validity. The paper therefore frames AutoDex as a system-level bridge: generate many candidates algorithmically, then verify them on real hardware in a fully automated loop (Choi et al., 22 Jun 2026).

The resulting dataset is not merely a collection of successful grasps. AutoDex is designed to store both successes and failures, labeled from real execution, across diverse objects, poses, and scene constraints. This suggests that the system is intended not only for direct execution through retrieval, but also for downstream supervision in settings where negative examples and contact-informed failures matter.

2. End-to-end architecture and collection loop

AutoDex is organized as an end-to-end real-world loop with four major components: dense multi-camera perception for object pose estimation and tracking, candidate grasp execution with collision-monitored robot motions, outcome labeling via lift-and-hold success or failure, and active object reset to continue collection across stable poses (Choi et al., 22 Jun 2026).

At runtime, the system starts from an object model and a scene specification. A modular candidate generator produces candidate grasps, each represented as a wrist pose in the object frame plus a hand configuration. AutoDex then estimates the object’s current stable pose, transforms the object-relative candidates into the robot/world frame, filters them by feasibility, executes selected candidates on the physical robot hand, labels the outcome, stores the trial, and marks the candidate as attempted. If no untried feasible candidate remains for the current stable pose, the system resets the object to another stable pose and continues (Choi et al., 22 Jun 2026).

The database is explicitly designed to be reusable. Once a trial is completed, AutoDex stores the physical execution record and marks the candidate as attempted for that stable pose. Downstream methods can then retrieve only successful grasps and screen them for feasibility in new scenes. This retrieval-oriented design is important: the paper does not present AutoDex primarily as a policy-learning architecture, but as an infrastructure for generating a physically validated repository that can later be queried.

3. Perception, synchronization, and execution safety

The perception subsystem uses 20 synchronized RGB cameras arranged around the scene. Each camera records at 30 FPS and 2048×15362048 \times 1536 resolution, synchronized by sub-millisecond hardware triggers. Camera intrinsics are calibrated with a ChArUco board, and per-session extrinsics are recovered using COLMAP plus hand-eye calibration. The dense layout is motivated by the fact that the robot hand severely occludes the object during grasping, so redundant views improve pose estimation and tracking reliability (Choi et al., 22 Jun 2026).

For initial object pose estimation, the supplementary pipeline uses SAM3 for object masks, runs FoundPose on views with sufficient mask area, renders each pose hypothesis into all views, selects the one with the best cross-view silhouette–mask agreement, discards outliers that disagree with this consensus, and refines pose by silhouette optimization. During execution, multi-view tracking is run on the recorded stream using GoTrack so that success or failure can still be labeled when the hand occludes the object (Choi et al., 22 Jun 2026).

The paper reports a strong dependence on camera density. Mean ADD-S between the 20-camera reference pose and a subset estimate drops from 14.3 mm at 2 cameras to 0.5 mm at 16 cameras, with major gains between 2 and 4 cameras and fewer catastrophic failures once 8 cameras are used. This is presented as evidence that dense multi-view sensing materially improves object-pose reliability during collection (Choi et al., 22 Jun 2026).

Execution is performed with a 6-DoF xArm equipped with either a 16-DoF Allegro Hand or a 6-DoF Inspire Hand. Robot state is logged at 120 Hz, including joint positions, velocities, torques, motor currents, and commanded targets from both arm and hand. A calibrated timestamp offset aligns robot-state streams with the camera exposures (Choi et al., 22 Jun 2026).

Because the collection loop is unattended, AutoDex incorporates learned collision monitoring rather than relying on the built-in xArm collision detector, which was found unreliable under persistent external loads from grasped objects and altered inertia from the dexterous hand. The residual-torque monitor is summarized as

τ^θ=MLPθ(q,q˙),τres=τ^θτmotor.\hat{\tau}_{\theta} = \mathrm{MLP}_{\theta}(q,\dot q), \qquad \tau_{\mathrm{res}} = \hat{\tau}_{\theta} - \tau_{\mathrm{motor}}.

In the detailed implementation, the monitor runs at 100 Hz, costs only about 30μs30\,\mu s on CPU, and is active only during contact-critical motions such as downward moves near the table. A contact is declared when the baseline-subtracted residual on proximal joints J1J_1 and J2J_2 exceeds 30 N·m for 50 samples, about 0.5 s; if triggered, motion aborts and the robot recovers by planning back to home while monotonically increasing end-effector height (Choi et al., 22 Jun 2026).

4. Candidate generation, stable poses, and active reset

AutoDex uses BODex as the candidate generator. It is instantiated with three scene primitives—wall, shelf, and box—which model common tabletop constraints where some approach directions are blocked. For each object, the system defines a set of stable tabletop poses P\mathcal{P}, and for each stable pose PiPP_i \in \mathcal{P}, it aggregates candidates into a pose-indexed set Gcand(Pi)\mathcal{G}_{\mathrm{cand}}(P_i). At runtime, only candidates for the current stable pose are considered; once those are exhausted, the system resets the object to another pose with remaining candidates (Choi et al., 22 Jun 2026).

The supplementary description adds several details. Stable poses are canonicalized if the object has axial symmetry. Symmetric pose duplicates are reduced by checking whether the transformed symmetry axis is near vertical; if so, the pose is kept as one canonical pose. Otherwise, the system instantiates five yaw variants, {0,72,144,216,288}\{0,72,144,216,288\}^\circ, described as an efficiency measure to increase scene coverage. Scene tightness is parameterized relative to the object: wall gap 2, 4, 6, 8 cm; shelf with the same gap to side walls, back wall, and ceiling; and box with only the top 5 or 8 cm above the rim exposed. BODex is run with escalating seed counts, N{200,1000}N \in \{200,1000\}, until at least five generator-valid candidates are found (Choi et al., 22 Jun 2026).

Before spending real robot time, candidates pass a simulation sanity pre-filter in MuJoCo. The hand executes pre-grasp τ^θ=MLPθ(q,q˙),τres=τ^θτmotor.\hat{\tau}_{\theta} = \mathrm{MLP}_{\theta}(q,\dot q), \qquad \tau_{\mathrm{res}} = \hat{\tau}_{\theta} - \tau_{\mathrm{motor}}.0 grasp τ^θ=MLPθ(q,q˙),τres=τ^θτmotor.\hat{\tau}_{\theta} = \mathrm{MLP}_{\theta}(q,\dot q), \qquad \tau_{\mathrm{res}} = \hat{\tau}_{\theta} - \tau_{\mathrm{motor}}.1 squeeze, after which a weak external force of 0.1 N is applied along gravity for 50 simulation steps. A candidate is rejected if the object translates more than 5 cm. The paper stresses that this is not intended to validate physical success; it only removes obviously implausible candidates (Choi et al., 22 Jun 2026).

Approach planning uses a separately optimized open pose τ^θ=MLPθ(q,q˙),τres=τ^θτmotor.\hat{\tau}_{\theta} = \mathrm{MLP}_{\theta}(q,\dot q), \qquad \tau_{\mathrm{res}} = \hat{\tau}_{\theta} - \tau_{\mathrm{motor}}.2 to maximize clearance: τ^θ=MLPθ(q,q˙),τres=τ^θτmotor.\hat{\tau}_{\theta} = \mathrm{MLP}_{\theta}(q,\dot q), \qquad \tau_{\mathrm{res}} = \hat{\tau}_{\theta} - \tau_{\mathrm{motor}}.3 where τ^θ=MLPθ(q,q˙),τres=τ^θτmotor.\hat{\tau}_{\theta} = \mathrm{MLP}_{\theta}(q,\dot q), \qquad \tau_{\mathrm{res}} = \hat{\tau}_{\theta} - \tau_{\mathrm{motor}}.4 is the set of collision spheres on finger τ^θ=MLPθ(q,q˙),τres=τ^θτmotor.\hat{\tau}_{\theta} = \mathrm{MLP}_{\theta}(q,\dot q), \qquad \tau_{\mathrm{res}} = \hat{\tau}_{\theta} - \tau_{\mathrm{motor}}.5, τ^θ=MLPθ(q,q˙),τres=τ^θτmotor.\hat{\tau}_{\theta} = \mathrm{MLP}_{\theta}(q,\dot q), \qquad \tau_{\mathrm{res}} = \hat{\tau}_{\theta} - \tau_{\mathrm{motor}}.6 is signed clearance to the nearest obstacle, and τ^θ=MLPθ(q,q˙),τres=τ^θτmotor.\hat{\tau}_{\theta} = \mathrm{MLP}_{\theta}(q,\dot q), \qquad \tau_{\mathrm{res}} = \hat{\tau}_{\theta} - \tau_{\mathrm{motor}}.7 is the candidate wrist pose. The robot approaches at τ^θ=MLPθ(q,q˙),τres=τ^θτmotor.\hat{\tau}_{\theta} = \mathrm{MLP}_{\theta}(q,\dot q), \qquad \tau_{\mathrm{res}} = \hat{\tau}_{\theta} - \tau_{\mathrm{motor}}.8, closes to τ^θ=MLPθ(q,q˙),τres=τ^θτmotor.\hat{\tau}_{\theta} = \mathrm{MLP}_{\theta}(q,\dot q), \qquad \tau_{\mathrm{res}} = \hat{\tau}_{\theta} - \tau_{\mathrm{motor}}.9 and then 30μs30\,\mu s0, and then over-closes to a squeeze

30μs30\,\mu s1

with 30μs30\,\mu s2 for Allegro and 30μs30\,\mu s3 for Inspire, clipped to joint limits (Choi et al., 22 Jun 2026).

Active reset is a central subsystem rather than an auxiliary convenience. Since grasp candidates are indexed by stable pose, collection would quickly stall without a method for moving the object among its stable poses. Reset is therefore solved as a grasp planning problem: AutoDex combines the pickup constraints at the current stable pose 30μs30\,\mu s4 and the placement constraints at the target pose 30μs30\,\mu s5 into one collision scene. For flat objects, it uses a height-relaxed placement, allowing release from a height 30μs30\,\mu s6 above the target pose, and adds virtual support pillars beneath the release zone during reset-grasp generation so that fingers do not intrude into the object’s descent region. Reset grasps that succeed once are cached for the corresponding stable-pose pair and reused later (Choi et al., 22 Jun 2026).

5. Outcome labeling, dataset structure, and experimental results

Outcome labeling is automatic. After execution, a grasp is labeled successful if the object remains at least 5 cm above its initial height throughout a 3 s hold phase; otherwise it is labeled a failure. The label is applied post hoc from the recorded trajectory so that the control loop remains real-time. Stored trial records include the executed plan, multi-view recordings, tracked object poses, camera calibration, timing metadata, scene information, and the final success/failure label (Choi et al., 22 Jun 2026).

The workcell consists of a 6-DoF xArm, a swappable multi-finger hand—Allegro or Inspire—a 30μs30\,\mu s7 m LED-lit capture cell, 20 synchronized RGB cameras, and robot-state logging at 120 Hz. Calibration uses intrinsics via a sweeping ChArUco board, extrinsics via COLMAP bundle adjustment with mean reprojection error of 0.2–0.5 px, and hand-eye calibration with a ChArUco target mounted on the end effector. End-to-end alignment is checked by reprojecting robot and object meshes into all camera views (Choi et al., 22 Jun 2026).

The dataset includes 100 household objects, more than 80% sourced from IKEA, with the remainder from common household items. The objects span seven dominant material categories—plastic, metal, wood, silicone, paper, tape, and ceramic—and a wide weight range from under 50 g to over 500 g. The abstract reports 3,593 grasp trials across Allegro and Inspire hands on these 100 objects, with synchronized multi-view observations and robot-state logs (Choi et al., 22 Jun 2026).

The main quantitative results concern throughput and validation quality.

Comparison AutoDex Baseline
Matched 500-trajectory collection time 10.3 h 49.4 h for teleoperation
Retrieval success from validated vs model-screened database 76% 34%
Mean ADD-S with camera subsets 0.5 mm at 16 cameras 14.3 mm at 2 cameras

In the matched 500-trajectory study, the 10.3 h versus 49.4 h comparison corresponds to a 4.8x throughput improvement. The paper attributes the gain not to faster individual robot executions, which are similar in duration between methods, but to the elimination of human idle time, fatigue, and supervision overhead (Choi et al., 22 Jun 2026).

For retrieval-based evaluation on 20 objects and 515 total trials, the AutoDex-validated database achieves 76% real-world success, while a model-screened database filtered only by simulation sanity checks achieves 34%. The improvement is reported across material categories, scene types, and weight bins, and is especially strong in cluttered scenes. The stated conclusion is that kinematic and collision feasibility alone are insufficient; physical validation is necessary to build a reliable dexterous grasp database (Choi et al., 22 Jun 2026).

The runtime analysis over 500 autonomous trials reports a mean grasp-execution loop duration of 48.2 s, partitioned into robot execution at 51.5% or 24.8 s, retract at 24.6% or 11.9 s, perception and planning at 16.1% or 7.8 s, and motion planning at 7.8% or 3.8 s. After the collision monitor was enabled, it halted the arm during downward approach in 2.7% of trials and during placement in 8.9% of trials; visible contacts accounted for 0.3% and 3.1% of all trials, respectively, with the remainder conservative aborts without visible contact (Choi et al., 22 Jun 2026).

6. Position within dexterous manipulation research, limitations, and downstream use

AutoDex is a data-collection and physical-validation system rather than a skill-learning or policy-selection framework. In the broader dexterous manipulation literature represented here, AdaDexGrasp learns a library of grasping skills from a single human demonstration per skill and selects the most suitable one using a vision-LLM, while EaDex uses a single RGB-D camera, MANO-based hand modeling, and dynamic demonstration annealing for cross-embodiment dexterous manipulation (Shi et al., 26 Mar 2025, Zhao et al., 2 Jun 2026). AutoDex differs in emphasis: it automates perception, execution, labeling, and reset in the real world so as to produce a reusable database of physically labeled trials (Choi et al., 22 Jun 2026).

The paper is explicit about current scope. AutoDex currently targets stable grasps for a fixed arm-hand setup; it does not target bimanual coordination, mobile manipulation, finger-rolling regrasps, or functional grasps such as tool use or handover. It also depends on the capability and biases of the upstream candidate generator, so grasps requiring dynamic finger motion during contact may be missed. Dense multi-view perception improves reliability, but the highest-throughput version still requires a relatively complex multi-camera workcell (Choi et al., 22 Jun 2026).

The downstream use case is retrieval-based in-the-wild execution. At deployment, the system observes the workspace with four RGB cameras, estimates object and obstacle geometry, retrieves successful grasps for the target object, transforms them into the scene frame, and filters them by collision-free arm planning. The first feasible grasp is then executed. A plausible implication is that AutoDex can serve as infrastructure for methods that prefer physically validated exemplars over purely simulated priors, especially when contact fidelity rather than policy generality is the primary requirement (Choi et al., 22 Jun 2026).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AutoDex.