VFAS-Grasp: Visual Feedback for Robotic Grasping
- The paper introduces VFAS-Grasp, a closed-loop planner that fuses real-time visual feedback with uncertainty-aware adaptive sampling to improve grasp success rates.
- It employs vision-based motion field estimation and a composite scoring method that balances grasp quality, uncertainty, and positional deviations for smooth, effective control.
- Empirical evaluations reveal marked improvements over static planners, achieving up to 84% success in static scenes and 80% in dynamic settings, while highlighting avenues for future enhancements.
VFAS-Grasp—Visual Feedback and Adaptive Sampling for Robotic Grasping—denotes a class of closed-loop robotic grasping planners that integrate real-time sensory feedback with iterative, uncertainty-aware sampling of grasp actions. VFAS-Grasp systems typically operate at high frequency (e.g., 20 Hz), adapt their search region and candidate population in response to recent grasp quality assessments, and leverage vision-based motion estimation to maintain temporal and spatial consistency of the grasp proposal in dynamic environments. This approach demonstrably improves both static and dynamic grasp success rates, particularly in scenarios where small pose inaccuracies or delayed responses traditionally degrade performance (Piacenza et al., 2023).
1. Core Pipeline and Operational Principles
VFAS-Grasp systems initialize from an externally generated six-degree-of-freedom (6-DoF) grasp plan, such as via Contact-GraspNet, using a single-shot view of the scene. Subsequent control is conducted through an iterative, closed-loop regime, where at each cycle:
- A wrist-mounted RGB-D camera acquires a visual observation of the region proximate to the current 'seed' grasp.
- The point cloud is processed and cropped around this seed action.
- Candidate grasps are adaptively sampled by applying random translational and rotational perturbations to the seed pose.
- Each candidate is evaluated by a grasp-quality estimator (typically a neural network), which outputs both an expected grasp score and an uncertainty value derived from input noise injections.
- Candidates are scored by a metric combining estimated quality, uncertainty, and penalties for spatial and rotational deviation from the previous seed, promoting both robustness and temporal smoothness.
- The leading candidate, after smoothing via a one-euro low-pass filter, becomes the new seed and is shifted by a 3D motion vector field estimator if dynamic object motion is detected.
- The selected grasp pose is transmitted to a low-level robot controller, which drives the manipulator through a pre-grasp to final closure sequence.
This loop executes at 20 Hz on commodity GPU hardware, enabling real-time, responsive grasp adaptation in the presence of object or pose disturbances (Piacenza et al., 2023).
2. Uncertainty-Aware Adaptive Sampling
The adaptive sampling subsystem dynamically tunes both the size of the candidate region (translation radius and angular span ) and the number of samples . The update rule:
where is the top candidate quality at the previous iteration, a fixed threshold (e.g., 0.5), and an empirically selected scaling factor (e.g., 1.3). Nominal values typically correspond to a ±2 cm, ±5° region and 128 samples. While uncertainty does not directly modulate or , it is critical in scoring and selection, penalizing candidates with high estimate variance (Piacenza et al., 2023).
3. Quality, Uncertainty, and Temporal Consistency Scoring
Each candidate grasp is evaluated using:
- The mean estimated quality 0,
- The spread 1 as a proxy for epistemic uncertainty (over multiple stochastic network passes),
- Translation and rotation deviations 2 and 3 from the previous seed.
The composite score is:
4
Typical settings: 5 (cm⁻¹), 6 (rad⁻¹), 7. This scoring approach jointly optimizes for high expected quality, robustness to input noise, and smooth, incremental trajectory evolution, reducing discontinuities unfavorable for closed-loop execution (Piacenza et al., 2023).
4. Vision-Based Motion Field Estimation for Dynamic Targeting
VFAS-Grasp employs transformer-based optical flow networks, such as GMFlow, to estimate a dense 3D motion vector field from consecutive RGB-D frames. The spatial mean of these vectors within a local spherical neighborhood (radius ≈5 cm) around the grasp approach point yields an averaged motion estimate 8. The seed grasp center is shifted by 9, guiding the planning region towards slow, continuous object motions encountered during tasks like human-to-robot handover. Rotational motion is currently disregarded under the assumption of slow object rotation but could be integrated when relevant (Piacenza et al., 2023).
5. Empirical Evaluation and Ablation Results
VFAS-Grasp exhibits substantial improvement over static planners in both simulated and physical settings. Key results include:
- Static scene: Baseline (CGN alone) 60% success; VFAS-Grasp reaches 84%, with objects sensitive to pose errors (e.g., cans) seeing improvement from 27% to 93%.
- Dynamic objects: At higher speeds, adaptive sampling is crucial—success rises from 35% (no adaptation) to 80% (adaptive).
- Human-to-robot handover: 81.25% success within 20 seconds across multiple users and objects.
- Timing performance: End-to-end loop runs at 20 Hz, with sampling/evaluation (25 ms), flow estimation (5 ms), and the remainder split between filtering and actuation.
Ablation studies confirm that dynamic region/sample scaling and explicit motion estimation are important for success in higher-speed regimes. The use of explicit uncertainty penalization improves robustness in challenging settings by approximately 5% (Piacenza et al., 2023).
6. Limitations and Extensions
Some operational boundaries of VFAS-Grasp have been observed:
- Grasp "drift" may arise along continuous grasp manifolds, causing the system to gradually shift contact regions (e.g., from rim to handle on mugs).
- In cluttered or multi-object scenes, lack of semantic segmentation can lead to tracking or grasping the wrong object.
- The controller uses a simple Cartesian P-servo; high-speed, collision-aware, and kinematically feasible replanning remains a target for future improvement.
- Object speeds exceeding 10 cm/s may require predictive, model-based compensation strategies, such as incorporating Kalman filtering.
- Extensions proposed in related work include modeling variable force and friction cone constraints, augmenting reasoning components to estimate dynamic force profiles and tactile patterns, and integrating velocity-conditioned grasp generation modules (Piacenza et al., 2023, Zhang et al., 3 Dec 2025).
7. Context within Advanced Grasping Research
VFAS-Grasp directly addresses the limitations of open-loop and static 6-DoF grasp planning in real-world, dynamic settings. In contrast to approaches based on vision-language grounding and affordance semantics (e.g., OmniDexVLG (Zhang et al., 3 Dec 2025)), which focus on high-level intent and dexterous grasp diversity, VFAS-Grasp emphasizes closed-loop, real-time refinement under uncertainty and scene motion. The two paradigms are increasingly seen as complementary; for example, future VFAS systems may incorporate semantic priors and affordance constraints to improve object- and task-level robustness, as indicated by recent proposals coupling VFAS pipelines with affordance sampling and semantic reasoning modules (Zhang et al., 3 Dec 2025).
VFAS-Grasp represents a canonical approach for fusing sensory feedback, uncertainty quantification, and real-time optimization in robotic grasp control, and continues to inform the design of adaptive, robust, and generalizable robotic manipulation systems.