AutoDex: An Automated Real-World System for Dexterous Grasping Data Collection

Published 22 Jun 2026 in cs.RO and cs.LG | (2606.23689v1)

Abstract: Learning robust dexterous grasping requires real-world data that records the physical outcomes of grasp attempts. Such data is hard to obtain at scale: teleoperation yields valid physical outcomes but is slow and operator-biased, while simulation-based generation is cheap and scalable but cannot certify contact validity. A natural solution is to generate candidate grasps and verify them on real hardware, but this scales only if the entire collection loop (perception, execution, labeling, and reset) runs without human intervention. We present AutoDex, an automated real-world data-collection system that closes this loop: for each candidate from a replaceable generator, it localizes the object under severe hand-object occlusion with dense 20-camera perception, executes collision-monitored robot motions, labels lift-and-hold success or failure, and actively resets the object between trials to expose additional candidates across stable poses. The result is a reusable database of physically labeled grasp trials that downstream systems can query by retrieval and feasibility filtering. Using AutoDex, we collect 3,593 grasp trials across Allegro and Inspire hands on 100 diverse objects, with synchronized multi-view observations and robot-state logs. For a matched 500-trajectory collection, AutoDex requires 10.3 h versus 49.4 h for teleoperation, yielding a 4.8x throughput improvement, and grasps retrieved from the AutoDex-validated database succeed 76% versus 34% for simulation-only validation. Code and data will be publicly released.

Abstract PDF Upgrade to Chat

Authors (7)

Summary

The paper demonstrates an automated pipeline that achieves a 4.8× throughput gain over human teleoperation for dexterous grasp data collection.
The system integrates multi-view pose estimation, collision-safe grasp execution, and autonomous object reset for reliable, physically validated grasp trials.
Real-world experiments show a grasp success rate of 76% with physical validation versus 34% for simulation-only, highlighting the sim-to-real gap.

AutoDex: An Automated Real-World System for Dexterous Grasping Data Collection

Problem Motivation and Context

The reliable synthesis of dexterous robotic grasps on diverse objects remains bottlenecked by the dearth of scalable, real-world datasets annotated with ground-truth physical outcomes. Multi-fingered dexterous grasping, in contrast to parallel-jaw approaches, depends critically on factors that are not robustly modeled or simulated—namely the intricacies of contact physics, slip, compliance, and unmodeled actuator or hand-object-object frictional effects. While simulated grasp generation and geometric feasibility checks enable large-scale candidate proposal, only execution on physical hardware reliably determines grasp validity. However, current real-world data acquisition methods—such as human teleoperation—are fundamentally limited in throughput, generalization, and introduce operator bias. Thus, there is a critical need for an autonomous, end-to-end real-world data collection system that can cycle through perception, grasp execution, outcome labeling, and reset without human intervention.

System Architecture and Methodology

AutoDex is introduced as a fully automated pipeline designed to collect large-scale, physically annotated grasp data across diverse household objects and hardware setups. The system’s core comprises a 6-DoF xArm manipulator equipped with hot-swappable multi-fingered hands and a 20-camera synchronized, LED-lit workcell, enabling dense multi-view perception critical for robust pose estimation under severe object occlusion (Figure 1).

Figure 1: The multi-camera AutoDex workcell with 6-DoF arm, swappable hand, and examples of generated candidate grasps paired with real-world executions across varied scene constraints.

At each cycle, a set of grasp candidates is generated using a modular external generator (here, BODex), which leverages object models and scene constraints to propose kinematically and geometrically feasible grasp poses. These do not guarantee stability under contact dynamics and so require real-world validation. AutoDex’s loop consists of:

Robust Multi-View Pose Estimation: 20 high-resolution, synchronized RGB cameras estimate the object’s stable 6D pose, maintaining high accuracy even under occlusion (mean ADD-S $<$ 1mm with $\geq$ 8 cameras, Figure 2).
Feasibility Screening: Candidates are filtered for IK solvability and collision-free trajectories in the current scene.
Safe Grasp Execution: The robot executes candidate grasps with trajectory segments protected by a learned, residual-torque-based collision monitor, which is more robust than factory sensors in detecting unexpected contact loads and avoids spurious aborts.
Lift-and-Hold Validation: Grasp efficacy is decided by lifting the object 5cm and holding for 3s, tracked post hoc by high-redundancy multi-view pose tracking.
Autonomous Object Reset: Upon exhausting the candidate set for a pose, the robot reorients the object to a new stable pose (requiring active, non-passive placement), enabling exhaustive and efficient exploration of candidate grasps across all object configurations (Figure 2).

Database Construction and Retrieval

The outcome is a database containing thousands of physically executed and labeled grasp trials (success/failure), with each record comprising synchronized robot and camera data, pose trajectories, and scene descriptors. Importantly, the system supports online retrieval: at deployment, given novel scene geometry and object pose estimated from a reduced set of cameras, successful database grasps are filtered for scene feasibility (IK, collision), and the first feasible candidate is directly executed without additional training or manual intervention (Figure 3).

Figure 3: Illustration of the AutoDex pipeline, from candidate execution, physical validation, to deployment-time grasp retrieval.

Experimental Analyses

Throughput and Autonomous Operation

In matched experiments (500 trajectories), AutoDex requires just 10.3 hours to collect all trials, compared to 49.4 hours for human teleoperation—a 4.8 $\times$ gain in effective throughput. This gain is achieved not by reducing per-execution time, but by eliminating all human idle time and enabling unattended, overnight operation cycles (Figure 4).

Figure 4: Left—AutoDex throughput compared to teleoperation; Right—Success rate of AutoDex-validated grasp database versus model-screened (simulation-only) candidates.

Effect of Physical Validation

The study demonstrates that geometric/model-based feasibility checks are grossly insufficient. When deploying the AutoDex-validated database (grasps that succeeded in physical trials), the real-world grasp success rate is 76%. In contrast, grasps from a purely model-screened (simulation-validated) database only succeed 34% of the time, even after filtering for scene feasibility at deployment. The discrepancy is even more pronounced in cluttered scene configurations. This contradicts claims that simulation-only pipelines suffice for reliable database construction in dexterous grasping.

Reset Strategies and Unattended Collection

AutoDex’s reset module enables transitions between arbitrarily rare or difficult object pose pairs, maintaining nearly 100% reset success even for transitions with zero probability under passive drop. In contrast, strategies relying on reorient-and-drop or passive settlement fail routinely, and cause unacceptable trial ejections that require human intervention, thereby preventing fully unattended operation (Figure 2).

Figure 2: Left—Reset strategy comparison versus passive transition probability; Right—Pose self-consistency (mean ADD-S) as function of available camera count.

Multi-Camera Perception and System Robustness

By leveraging 20 synchronized RGB cameras, pose estimation is robust to occlusion, lighting, and object geometries. Pose self-consistency improves monotonically with the number of views, saturating at 8+ cameras. Fatal pose errors are virtually eliminated with dense perception, which is critical for high DoF hand operation without operator oversight.

Practical and Theoretical Implications

AutoDex sets a new standard for scalable, real-world data collection in dexterous grasping, decoupling database size and diversity from human operator hours. The demonstrated gains in success rate via physical validation (76% versus 34% for model-only) reveal the persistent sim-to-real gap for high-DoF contact interactions, and highlight that real-world execution is essential for dataset reliability. The results also establish that autonomous reset, collision-safe execution, and dense perception are requisite components for scalable unsupervised data-collection systems in robotics.

Practically, the AutoDex database enables direct retrieval-based in-the-wild deployment, reduces reliance on policy training, and improves the generalization of downstream learning and planning systems. Theoretically, the findings suggest that the persistent inaccuracies of frictional and contact modeling in simulation—particularly under high-dimensional, multi-contact scenarios—are unlikely to be fully bridged by current or near-term simulation advances alone.

Limitations and Future Work

Current limitations include a focus on single-arm, single-hand setups with stable, table-based object presentation. The system does not address mobile manipulation, bimanual tasks, dynamic in-contact re-grasping, or functional manipulation (tool use, handover). The requirement for a large, instrumented multi-camera cell limits immediately practical real-world deployment, though future work may compress these requirements. AutoDex also inherits the coverage constraints of the underlying candidate generator, missing grasps that require dynamic or exaptationist finger motions.

Further advancements could address more agile resets, dynamic contact manipulation, extend the approach to highly cluttered and unstructured environments, and scale down the perception system requirements as robust single- and small-multi-view perception matures.

Conclusion

AutoDex demonstrates the first scalable, fully autonomous real-world pipeline for dexterous grasp dataset collection. It achieves significant throughput gains over teleoperation, produces richly annotated databases of physical grasp outcomes, and enables practical retrieval-based execution with high reliability. Critically, it shows that physical validation is essential for closing the sim-to-real gap in dexterous manipulation—a claim supported by strong empirical results and systematic evaluation across pose, scene, and object categories. The AutoDex pipeline, dataset, and codebase are poised to facilitate new advances in manipulation research, data-driven RL, and robust generalization for dexterous robots.

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

What this paper is about (in simple terms)

This paper is about teaching robot hands (with fingers, not just simple grippers) to pick up many different real-world objects reliably. The authors built a system called AutoDex that can test tons of “grasp ideas” automatically—without a person standing there controlling the robot—and record which ones work and which ones fail. The end result is a big, reusable collection of real robot attempts with clear success/failure labels that other robots can learn from.

What questions were the researchers trying to answer?

How can we collect lots of real, trustworthy data about multi-finger robot grasps without needing a human to guide every try?
Can we combine cheap computer-generated grasp ideas with real-world testing to find which grasps actually work on physical objects?
Will automating the whole loop—seeing the object, moving the robot, deciding success/failure, and resetting the object—be faster and more reliable than having a person teleoperate (remote-control) the robot?
Does using many cameras help the robot keep track of the object even when the hand blocks the view?

How did they do it?

Think of AutoDex as a self-running “grasp test lab” for a robot hand:

The robot: A 6-joint robot arm with a multi-finger hand (they tested two: the Allegro and Inspire hands).
The “eyes”: 20 synchronized cameras around a well-lit workspace.
The brain loop: A full cycle that runs without humans.

Here’s the cycle in everyday language:

See (Perception): The cameras figure out the object’s “6D pose” (where it is in 3D and which way it’s facing—like knowing a LEGO brick’s position and rotation). Because the hand can cover the object during a grasp, using many cameras helps keep track from other angles.
Plan and Move (Execution): A computer method first generates many candidate grasps (like possible ways to hold the object) in a simulator. AutoDex picks feasible ones (arm can reach them without bumping into things) and moves the robot hand to try them. A safety checker watches the arm’s “effort signals” to stop if there’s an unexpected bump.
Judge (Labeling): After the hand grabs the object, the robot lifts and holds it. If the object stays at least 5 cm up for 3 seconds, that attempt counts as a success. If it slips or drops, it’s a failure. The system records the robot’s motions, the camera views, and the success/failure label.
Reset (Set up the next try): To keep testing new grasps, the object often needs to be placed in a different resting position (a “stable pose”). The robot reorients and places the object itself—sometimes even releasing it slightly above the table so it lands correctly—so the next round can start immediately.

Why not just use a simulator? Simulations are fast, but they can’t perfectly predict real-world contact: friction, tiny slips, squishy finger pads, and small force differences can make a grasp fail even if it looks good on a computer. AutoDex solves this by automatically testing candidates on real hardware.

What did they find?

It’s much faster than human teleoperation: In a matched test of 500 grasp trials, AutoDex finished in 10.3 hours. A human operator doing the same in the same setup took 49.4 hours. That’s about a 4.8× speed-up. The big win is not faster single moves; it’s removing human idle time so the robot can run unattended.
Real-world validation greatly improves grasp quality: When they collected a database of grasps that had actually been tested on the real robot and succeeded, those grasps worked 76% of the time in new real scenes. If they skipped the real-world testing step and only used candidates filtered by a simulator, success dropped to 34%. In short: testing on the real robot filters out “looks-good-in-sim-but-fails-in-reality” grasps.
Resetting the object matters: Simply dropping an object and hoping it lands in the right pose often doesn’t work. AutoDex’s active placement strategy can reliably switch the object to poses that passive dropping almost never reaches, and it avoids unsafe “throwing the object” failures.
More cameras = more reliable tracking: With only a couple of cameras, the system sometimes mis-estimates the object’s pose—especially when the hand blocks the view. As they add more cameras, the pose estimates get much more stable, which improves the whole process.
A large, reusable dataset: AutoDex collected a large number of real grasp attempts (across 100 different household objects of many shapes and materials), each with synchronized multi-view videos and robot motion data, all labeled as success or failure. Later, a robot can simply “retrieve” successful grasps from this database for a new scene, check they’re reachable and collision-free, and try them—no extra training needed.

Why this matters

Better training fuel for robot hands: Multi-finger grasping is hard because many small physical details affect success. AutoDex provides the kind of real, labeled data that learning methods need to get robust.
Scales up without burning people out: Since the whole loop runs by itself, labs or companies can collect many more trials in the same time, across more objects and scenes.
More reliable robots in the real world: By combining computer-generated ideas with real-world testing, AutoDex closes the gap between “works in simulation” and “works in your kitchen,” helping robots handle everyday objects more confidently.
Practical reuse: The resulting database can be used like a library of proven grasps. A robot in a new environment can fetch a successful grasp for a known object, check it fits the new scene, and execute it—saving time and avoiding lots of trial and error.

In short, AutoDex is a fully automated, camera-rich, safe, and scalable way to test and label real robot hand grasps. It collects better data faster than human teleoperation and turns those results into a grasp library that helps robots succeed more often in the real world.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a concise, actionable list of what the paper leaves unresolved or insufficiently explored.

Dataset scale is under-reported: the exact number of collected grasp trials (appears as “,” in multiple places) is missing, preventing reproducibility and benchmarking.
Label accuracy is not quantified: no analysis of false positives/negatives in the lift-and-hold success criterion (5 cm lift, 3 s hold) or cross-checking with manual verification.
Failure-mode taxonomy is absent: labels are binary (success/failure) without categorizing reasons (e.g., slip, insufficient friction, finger placement error, kinematic reach issues), limiting diagnostic value for learning.
Pose-tracking reliability during heavy occlusion is not measured: while dense cameras are used, there is no quantitative assessment of tracking failures under extreme hand–object occlusion or fast motions.
Camera-density generalization is unclear: the system depends on a 20-camera rig; systematic evaluation of performance with fewer cameras (e.g., 4–8) in the main workcell is limited to ADD-S vs. k and does not report grasp validation success rates or label error rates at lower k.
Illumination and background robustness are not addressed: data are collected in a controlled LED-lit cell; performance under varied lighting, backgrounds, or outdoor conditions remains unknown.
Object types are constrained: articulated, soft/deformable, flexible, transparent, reflective, or liquid-containing objects are not explicitly tested; methods to handle these remain open.
Scene diversity is limited: evaluation focuses on tabletop, wall, and clutter; tight, dynamic, or highly constrained environments (drawers, cabinets, pockets, tool racks) are not studied.
Upstream generator coverage and bias are unquantified: database contents depend on BODex; missing grasps (e.g., requiring dynamic finger motion, finger-rolling, in-hand manipulation) are acknowledged but not measured.
Cross-generator comparison is missing: how AutoDex outcomes vary with different synthesis methods (optimization vs. learned generative models) is not evaluated.
Dynamic/contact-rich strategies are out of scope: finger-rolling regrasps, compliant or tactile-based adjustments, functional grasps (tool use, pouring), and bimanual coordination are not supported.
Closed-loop control is absent: execution appears open-loop with pre-planned motions; the benefits of integrating online visual/tactile corrections during approach and grasp are unexplored.
Tactile sensing is not used: no fingertip tactile/force sensing to detect slip or contact quality; potential improvements from tactile integration are not investigated.
Arm-level collision monitoring may miss local contacts: residual torque monitoring focuses on shoulder joints (J1, J2); detection coverage for lateral/hand-level collisions and small contacts is unreported.
Collision monitor precision/recall is unmeasured: conservative aborts are reported, but missed collisions (false negatives), near-miss rates, and threshold sensitivity are not analyzed.
Reset strategy selection is heuristic: choosing next stable pose with remaining candidates is not optimized; scheduling to maximize throughput or coverage remains an open planning problem.
Reset robustness for large/heavy or fragile objects is unclear: success rates across mass, size, fragility, and tall/thin geometries are not reported; recovery from dropped or ejected objects relies on human intervention in baselines.
Placement height relaxation (virtual pillars) lacks formal safety guarantees: potential unintended contacts or pose perturbations post-release are not rigorously analyzed; trade-offs of release height h are not quantified beyond feasibility.
Perception pipeline’s dependence on RGB-only cues is a risk: transparent/reflective objects and textureless surfaces challenge silhouette refinement; the benefits of adding depth or NIR are not evaluated.
Object modeling for unknown items is not detailed: the paper assumes object models; procedures to create models on-the-fly (scanning/reconstruction, symmetries) and their impact on grasp validation are not provided.
Cross-hand transfer is not studied: how grasps validated with Allegro transfer to Inspire (and vice versa) is not analyzed; joint configuration mapping and differences in contact mechanics remain open.
Cross-robot/environment transfer is limited: in-the-wild execution uses four cameras with depth estimates; robustness to calibration drift, different kinematics, and workspace geometries is not quantified.
Database retrieval does not consider grasp diversity or ranking: no policies to select among multiple successful grasps (by stability margin, approach direction, contact region coverage, or task context).
Throughput scaling via parallelization is unaddressed: multi-workcell operation, automated object swapping, and maintenance overhead for sustained multi-day runs are not discussed.
Generalization to unseen objects is not covered: retrieval assumes known objects with validated grasps; strategies for novel-object adaptation (shape similarity, category-level grasps, on-robot rapid validation) are open.
Data annotations are limited: absence of contact point/patch, friction estimates, force distribution, or compliance metadata restricts usefulness for physics-aware learning.
Evaluation breadth is narrow: success is reported over 20 objects/515 trials; statistical confidence intervals, per-object breakdowns, and long-tail behavior analyses are missing.
Robustness to disturbances is not tested: performance under external perturbations (vibration, pushes, varying surface friction, wet/dirty surfaces) is unmeasured.
Labeling latency vs. online control trade-offs are not explored: success labeling is post hoc; potential performance gains from online tracking-driven termination or correction are unquantified.
Calibration stability over time is not assessed: sub-millimeter pose consistency is reported per session; drift across long runs, auto-recalibration, and self-check routines are not evaluated.
Safety under unexpected human/object intrusion is not characterized: automatic safeguards for unmodeled obstacles or human presence in the workcell are not detailed beyond residual torque monitoring.

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

Below are concrete use cases that can be deployed now in controlled environments (e.g., benchtop workcells) using the paper’s system and dataset, along with sector links, enabling tools/workflows, and key assumptions.

[Robotics/Manufacturing] High-throughput validation of dexterous grasp candidates for irregular products
- What: Use AutoDex to automatically execute and label thousands of multi-finger grasp trials on SKU-specific parts, tools, or consumer items to curate a “known-good” grasp library per item.
- Tools/Workflow: 20-camera workcell; BODex candidate generation; AutoDex execution + lift-and-hold labeling; residual-torque safety; reset planner; export to a grasp database.
- Value: 4.8× faster than teleoperation; improves real-world success of retrieved grasps from 34% to 76%.
- Assumptions/Dependencies: Access to calibrated multi-camera rig and a 6-DoF arm with a dexterous hand (e.g., Allegro/Inspire); mesh models or sufficient visual features; tabletop scenes.
[E-commerce/Warehousing] Rapid SKU onboarding for dexterous pick-and-place
- What: For new SKUs (irregular packaging, soft containers), collect validated grasps and deploy a retrieval-based executor in a 4-camera production cell.
- Tools/Workflow: AutoDex database creation; “in-the-wild” retrieval pipeline (4-camera pose estimation, obstacle checks, motion planning).
- Value: Faster SKU ramp-up; higher first-pass pick success without training policies.
- Assumptions/Dependencies: Static or semi-static scenes; known object poses/stable states; candidate generator coverage of relevant approach directions.
[Quality Assurance/Gripper R&D] Benchmarking grippers and finger materials under real contact physics
- What: Compare finger-pad materials, hand kinematics, or controller variants by running controlled AutoDex batches across the same object set.
- Tools/Workflow: Swap hand/finger assemblies; reuse the automated loop and labeling; analyze success vs. material/weight categories.
- Value: Real contact outcomes (slip/compliance effects) that simulations miss; reproducible comparisons.
- Assumptions/Dependencies: Consistent calibration; identical execution and reset protocols across trials.
[Academia] Curriculum-ready dataset and replicable workcell for dexterous manipulation research
- What: Use the released code and multi-view, physically labeled dataset for training, benchmarking, and ablation studies (e.g., sim-to-real, pose tracking, planning).
- Tools/Workflow: Public AutoDex dataset (success/failure labels, 20-view RGB, robot states); baseline retrieval executor; research on perception density vs. reliability.
- Value: Reduces barrier to entry; enables robust evaluation beyond simulation-only metrics.
- Assumptions/Dependencies: Dataset license and object library access; compute/storage for multi-view video; lab safety policies.
[Software/Sim-to-Real] Sanity-filtering and calibration of simulated grasp generators
- What: Use AutoDex as a physical-validation backend for simulation pipelines, pruning candidates that are geometrically feasible but fail under real dynamics.
- Tools/Workflow: Batch candidate generation → AutoDex validation → update generator priors/weights; closed-loop “data engine.”
- Value: Empirically tightens sim-to-real gap; improves generator precision on real tasks.
- Assumptions/Dependencies: Integration adapters between generator and AutoDex; consistent object models across sim/real.
[Robotics Integrators] “Data-collection-as-a-service” offering for clients needing dexterous grasps
- What: Build and operate an AutoDex workcell; deliver object-specific validated grasp libraries, execution configs, and safety envelopes for client deployments.
- Tools/Workflow: Turnkey workcell; remote operation; standardized data exports (grasps, trajectories, labels).
- Value: Outsources complex data collection; shortens time-to-deploy for dexterous applications.
- Assumptions/Dependencies: Service viability depends on throughput, uptime, and part logistics; NDA/IP handling for client objects.
[Safety/Operations Policy in Labs] Practical safety monitoring for unattended robot data collection
- What: Adopt the learned residual-torque collision detector for downwards/contact-prone motions; mandate upward-only recovery trajectories on abort.
- Tools/Workflow: Train MLP nominal-torque model; enable during approach/placement; thresholds and sustained-window checks; logging.
- Value: Prevents damage in unattended runs; compatible with diverse end-effectors where OEM collision settings are unreliable.
- Assumptions/Dependencies: Per-assembly calibration and training data; periodic re-baselining for drift.
[Education/Makerspaces] Low-vision (≤8 cameras) variant for coursework and demos
- What: Deploy a reduced-camera rig (8–12 cameras) to approach the robustness of the 20-camera setup while cutting cost and complexity.
- Tools/Workflow: Same pose-estimation and silhouette refinement pipeline; monitor ADD-S error vs. camera count; selective occlusion-aware placements.
- Value: Accessible teaching and prototyping environment; still benefits from multi-view redundancy.
- Assumptions/Dependencies: Occlusion-sensitive objects may still require more views; careful calibration remains essential.

Long-Term Applications

The following use cases require further research, scaling, or engineering beyond the current workcell constraints (e.g., fewer cameras, mobile platforms, broader tasks).

[Home/Service Robotics] Household manipulation with personalized, validated grasp libraries
- What: In-home robots continuously collect and validate grasps for user-owned objects, updating a private, personalized library for reliable daily assistance.
- Tools/Workflow: On-device AutoDex-like loop with sparse cameras (or RGB-D + tactile), periodic offline validation, retrieval-first execution; continual learning.
- Dependencies/Assumptions: Robust pose estimation with minimal sensors; safe on-device data collection near people; handling deformable/transparent items; privacy.
[Healthcare/Assistive Robotics] Reliable ADL (Activities of Daily Living) grasps for assistive hands
- What: Build validated libraries for utensils, medication bottles, grooming items; customize for user-specific handovers, orientations, and safety constraints.
- Tools/Workflow: Task-conditioned grasp validation (beyond lift-and-hold); human-in-the-loop preference constraints; tactile safety monitors.
- Dependencies/Assumptions: Clinical safety certification; compliant and fail-safe hardware; person-aware perception and control.
[Mobile Manipulation/Field Service] On-site validation for maintenance, inspection, and logistics
- What: Robots gather task-specific grasp data in situ (warehouses, factories, retail), gradually replacing lab-collected libraries with environment-specific validation.
- Tools/Workflow: Self-calibrating perception on sparse/moving sensors; environmental reconstruction; distributed grasp-data syncing; RaaS fleets.
- Dependencies/Assumptions: Robust tracking under variable lighting/occlusion; dynamic obstacle handling; policy and insurance frameworks for unattended operation.
[Advanced Dexterity] Functional grasps and non-prehensile maneuvers (tool use, handovers, finger-gaiting)
- What: Extend the loop to validate task success beyond lift-and-hold (e.g., pouring, turning knobs, inserting connectors), including dynamic regrasps.
- Tools/Workflow: Task-specific success metrics and sensors (force/torque, tactile), richer action spaces, multi-stage planning and labeling.
- Dependencies/Assumptions: High-fidelity sensing of contact events; generalized reset strategies; expanded candidate generators for dynamic actions.
[Bimanual/Collaborative Manipulation] Coordinated grasp libraries and resets for two arms/hands
- What: Validate cooperative grasps (stabilize with one hand, act with the other) and robust inter-arm transfer and placement resets.
- Tools/Workflow: Dual-arm planning/monitoring; multi-object pose tracking; synchronized residual-torque safety; shared grasp databases.
- Dependencies/Assumptions: Increased calibration complexity; collision-free coordination; safe failure modes.
[Perception-Light Systems] Camera-minimal (2–4 cameras) or vision+tactile systems achieving 20-camera reliability
- What: Replace dense multi-view with learned priors, tactile servoing, or active perception to eliminate catastrophic pose failures.
- Tools/Workflow: Tactile localization, contact-rich SLAM, object-specific priors; uncertainty-aware planning with active viewpoint control.
- Dependencies/Assumptions: Reliable tactile hardware; robust fusion of sparse vision and touch; on-line uncertainty estimates.
[Standardization/Policy] Reporting and safety standards for physically validated manipulation datasets
- What: Institutionalize best practices: publish physical outcome labels, success criteria, reset procedures, safety triggers, and pose-calibration metrics with datasets.
- Tools/Workflow: Community benchmarks; data cards detailing collection conditions (camera count, ADD-S, labeling rules); conformity tests.
- Dependencies/Assumptions: Community buy-in; publisher and funding-agency requirements; interoperability with ROS/industrial formats.
[Products and Services] Commercial offerings built on AutoDex-style capabilities
- What:
- “AutoDex Kit”: packaged multi-camera workcell with software stack (perception, safety, reset).
- “Validated Grasp Library” subscriptions for common object categories (kitchenware, tools, retail items).
- “Reset Planner” and “Residual-Torque Monitor” as ROS 2 packages or OEM firmware plugins.
- Cloud “Grasp Validation Service” with logistics for object shipment and data return.
- Dependencies/Assumptions: Hardware vendor partnerships; support for diverse hands/arms; SLAs for throughput and failure recovery; data/IP governance.
[AI Training Pipelines] Large-scale real-robot data engines for dexterous policy learning
- What: Use physically labeled grasp outcomes (success/failure trajectories) to train generalist manipulation policies, leveraging both imitation and offline RL.
- Tools/Workflow: Continuous AutoDex-style data harvesting; balanced sampling of successes/failures; policy evaluation with retrieval baselines.
- Dependencies/Assumptions: Scalable storage/compute; data quality control; coverage across objects, materials, and scenes; alignment between lift-and-hold and end-task objectives.
[Circular Economy/Recycling] Sorting and disassembly of irregular, mixed-material items
- What: Validate robust grasps for variable, damaged, or composite objects to improve throughput in sorting lines or disassembly stations.
- Tools/Workflow: On-line object library growth; retrieval with obstacle-aware planning; frequent reset between stable poses to expose new grasp affordances.
- Dependencies/Assumptions: Handling of dirt/damage-induced appearance change; occlusion-heavy scenes; unknown or approximate object models.

These applications extend directly from the paper’s contributions: a fully automated real-world validation loop (perception → execution → labeling → reset), a physically labeled multi-view dataset, a robust safety monitor, and a retrieval-based deployment path. Feasibility hinges on matching sensing density and calibration quality to task difficulty, ensuring candidate generators cover relevant contact modes, and adopting safety and reset strategies that tolerate unattended operation.

View Paper Prompt View All Prompts

Glossary

6D pose: A six-degree representation of an object’s position and orientation in 3D space (3D translation + 3D rotation). "estimates the object's initial 6D pose with dense 20-camera perception"
ADD-S: A 6D pose accuracy metric for symmetric objects (Average Distance of Model Points—Symmetric). "Mean ADD-S decreases from 14.3 mm at $k=2$ to 0.5 mm at $k=16$ "
Allegro Hand: A 16-DoF anthropomorphic robotic hand used for dexterous manipulation. "we use either a 16-DoF Allegro Hand or a 6-DoF Inspire Hand"
BODex: An optimization-based dexterous grasp synthesis method used to generate grasp candidates. "we use BODex~\cite{chen2024bodex} as the candidate generator"
bundle adjustment: A nonlinear optimization that jointly refines camera parameters and 3D structure for multi-view calibration. "Per-session extrinsics are recovered by global bundle adjustment (COLMAP)"
ChArUco board: A calibration target combining chessboard corners and ArUco markers for accurate camera intrinsic calibration. "Camera intrinsics are calibrated before mounting using a ChArUco board"
COLMAP: A structure-from-motion tool used for recovering camera poses and calibration in multi-view setups. "per-session extrinsics are recovered with COLMAP and hand-eye calibration"
cuRobo: A GPU-accelerated motion planning library for robots, used here to pre-screen safe trajectories. "All static and dynamic training trajectories are pre-screened with cuRobo~\cite{sundaralingam2023curobo}"
Depth Anything 3: A foundation model for estimating depth from images, used to reconstruct obstacle geometry. "reconstruct surrounding geometry from depth estimates using Depth Anything 3~\cite{dav3}"
DoF: Degrees of Freedom; the number of independent joint variables of a robot or hand. "a 6-DoF xArm"
domain randomization: A sim-to-real technique that randomizes simulation parameters to improve real-world robustness. "Domain randomization~\cite{tobin2017dr,openai2019rubikscube} can improve robustness"
end-effector: The robot’s tool at the tip of its kinematic chain (here, the robotic hand). "planning a trajectory back to a predefined home configuration while monotonically increasing the end-effector height"
FoundPose: An RGB-only foundation-feature 6D pose estimator used for initial object pose hypotheses. "FoundPose~\cite{ornek2024foundpose}, an RGB-only foundation-feature pose estimator"
GoTrack: A multi-view object pose tracker used to obtain 6D pose trajectories from recorded streams. "we run multi-view GoTrack~\cite{nguyen2025gotrack} on the recorded stream"
hand–eye calibration: The calibration that aligns robot (hand) and camera coordinate frames. "and hand-eye calibration, yielding sub-millimeter multi-view pose self-consistency"
IK (Inverse Kinematics): Computing joint angles that achieve a desired end-effector pose. "IK solvability"
IoU: Intersection-over-Union, a mask overlap metric used to score pose hypotheses. "scored by the mean IoU between its rendered silhouette and the observed masks"
lift-and-hold: A physical success criterion requiring lifting an object and holding it for a specified duration. "labels lift-and-hold success or failure"
MLP: Multilayer Perceptron; a feed-forward neural network used here to predict nominal torques. "the monitor predicts the nominal free-space torque with an MLP and computes the residual:"
MuJoCo: A physics engine used to simulate and pre-filter grasp candidates. "we screen each candidate in MuJoCo~\cite{todorov2012mujoco}"
residual-torque monitor: A learned model that detects unexpected contacts by comparing predicted and measured joint torques. "We therefore use a learned residual-torque monitor trained on collision-free motions of the deployed arm--hand assembly."
SAM3: Segment Anything Model 3; a segmentation model used to predict object masks from RGB images. "we predict object masks with SAM3~\cite{sam3}"
servo mode: A control mode where joints continuously track commanded positions/velocities along a trajectory. "executes the screened trajectories in servo mode"
silhouette optimization: Refining object pose by aligning the projected model silhouette to observed masks. "then refined by silhouette optimization"
Sim-to-Real: The transfer of methods or models developed in simulation to real-world hardware and conditions. "Dexterous Grasping, Autonomous Data Collection, Sim-to-Real"
stable pose: A static equilibrium orientation of an object on a support surface. "Let $\mathcal{P}$ denote the set of stable tabletop poses of the object."
teleoperation: Human control of a robot to perform tasks, often via a remote interface. "Human teleoperation~\cite{liu2024realdex,wang2024dexcap,qin2023anyteleop} produces real contact outcomes"
xArm: A 6-DoF industrial robotic arm used as the manipulator in the workcell. "a 6-DoF xArm"
hand–object occlusion: Visual occlusion of the object by the robot hand during manipulation, complicating perception. "tracks the object during execution despite severe hand--object occlusion"

View Paper Prompt View All Prompts

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

AutoDex: An Automated Real-World System for Dexterous Grasping Data Collection

Summary

AutoDex: An Automated Real-World System for Dexterous Grasping Data Collection

Problem Motivation and Context

System Architecture and Methodology

Database Construction and Retrieval

Experimental Analyses

Throughput and Autonomous Operation

Effect of Physical Validation

Reset Strategies and Unattended Collection

Multi-Camera Perception and System Robustness

Practical and Theoretical Implications

Limitations and Future Work

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What this paper is about (in simple terms)

What questions were the researchers trying to answer?

How did they do it?

What did they find?

Why this matters

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Immediate Applications

Long-Term Applications

Glossary

Open Problems

Continue Learning

Collections

Tweets

Don't miss out on important new AI/ML research

AutoDex: An Automated Real-World System for Dexterous Grasping Data Collection

Summary

AutoDex: An Automated Real-World System for Dexterous Grasping Data Collection

Problem Motivation and Context

System Architecture and Methodology

Database Construction and Retrieval

Experimental Analyses

Throughput and Autonomous Operation

Effect of Physical Validation

Reset Strategies and Unattended Collection

Multi-Camera Perception and System Robustness

Practical and Theoretical Implications

Limitations and Future Work

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What this paper is about (in simple terms)

What questions were the researchers trying to answer?

How did they do it?

What did they find?

Why this matters

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Immediate Applications

Long-Term Applications

Glossary

Open Problems

Continue Learning

Collections

Tweets

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research