ActivePose: Active 6D Object Pose Estimation and Tracking for Robotic Manipulation (2509.11364v1)

Published 14 Sep 2025 in cs.RO

Abstract: Accurate 6-DoF object pose estimation and tracking are critical for reliable robotic manipulation. However, zero-shot methods often fail under viewpoint-induced ambiguities and fixed-camera setups struggle when objects move or become self-occluded. To address these challenges, we propose an active pose estimation pipeline that combines a Vision-LLM (VLM) with "robotic imagination" to dynamically detect and resolve ambiguities in real time. In an offline stage, we render a dense set of views of the CAD model, compute the FoundationPose entropy for each view, and construct a geometric-aware prompt that includes low-entropy (unambiguous) and high-entropy (ambiguous) examples. At runtime, the system: (1) queries the VLM on the live image for an ambiguity score; (2) if ambiguity is detected, imagines a discrete set of candidate camera poses by rendering virtual views, scores each based on a weighted combination of VLM ambiguity probability and FoundationPose entropy, and then moves the camera to the Next-Best-View (NBV) to obtain a disambiguated pose estimation. Furthermore, since moving objects may leave the camera's field of view, we introduce an active pose tracking module: a diffusion-policy trained via imitation learning, which generates camera trajectories that preserve object visibility and minimize pose ambiguity. Experiments in simulation and real-world show that our approach significantly outperforms classical baselines.

Summary

The paper introduces a framework combining Vision-Language Models and robotic imagination to actively resolve 6D pose ambiguities in real-time scenarios.
It employs entropy-guided next-best-view selection and an equivariant diffusion policy for dynamic object pose tracking in cluttered environments.
Results show significant performance gains with up to 97.5% success in simulations and 90% success in integrated tasks like peg-in-hole assembly.

ActivePose: Active 6D Object Pose Estimation and Tracking for Robotic Manipulation

The paper "ActivePose: Active 6D Object Pose Estimation and Tracking for Robotic Manipulation" introduces an integrated framework for addressing the challenges of 6-DoF object pose estimation and tracking in the context of robotic manipulation. This framework is designed to dynamically leverage visual cues to resolve pose ambiguities and efficiently maintain object visibility during motion.

Active Pose Estimation

The active pose estimation framework employs a Vision-LLM (VLM) alongside "robotic imagination" to effectively detect and resolve ambiguities in real-time scenarios. The pipeline consists of offline and online stages:

Offline Stage: The system generates a comprehensive set of rendered views from the CAD model, computes the FoundationPose entropy for each view, and constructs a geometry-aware prompt that combines both ambiguous and unambiguous examples.
Online Stage: The algorithm assesses ambiguity using live images against the VLM. Upon detecting ambiguity, the system imagines a range of candidate camera poses through virtual rendering, scoring each based on a combination of VLM-generated ambiguity probability and pose entropy. This process subsequently guides the camera to the Next-Best-View (NBV) for precise disambiguation.
Figure 1: Pipeline of Active Pose Estimation. (1) Offline: The system computes pose entropy from rendered views and constructs a geometry-aware prompt using ambiguous and unambiguous examples. (2) Online: It assesses ambiguity from the live image via a VLM and selects the NBV through rendering and entropy-guided scoring.

Active Pose Tracking

The active pose tracking module capitalizes on an equivariant diffusion policy, trained through imitation learning, to preserve object visibility and minimize pose ambiguity:

Encoding and Denoising: The module encodes the current observation and applies denoising over $K$ reverse-diffusion steps to yield continuous SE(3) poses, subsequently executing these poses in a receding-horizon loop to maintain optimal camera positioning.
Trajectory Generation: It dynamically generates smooth camera trajectories that preserve the visibility of critical object features, essential for maintaining pose accuracy during dynamic manipulation tasks.
Figure 2: Pipeline of Active Pose Tracking. Encode the current observation, denoise over $K$ reverseâdiffusion steps to obtain continuous SE(3) poses, then execute the last $k$ poses in a recedingâhorizon loop.

Implementation and Results

The proposed ActivePose framework has been evaluated and validated in both simulation and real-world environments, demonstrating superior performance over classical baselines in various challenging scenarios:

Pose Estimation: The proposed method outperforms fixed-view and random-NBV approaches significantly, achieving a success rate as high as 97.5% in random placements and 95.0% in high-entropy placements, both in simulation and real-world settings.
Pose Tracking: The active tracking component shows enhanced capabilities under conditions of long-range linear motion, circular rotation, temporary occlusions, and random spatial movements. The success rates range from 52.5% to 91.3%, indicating significant improvements in maintaining object visibility and reducing ambiguity.
Integrated Tasks: In the context of peg-in-hole assembly, ActivePose delivers a consistent 90% success rate, highlighting its efficacy in maintaining accurate pose estimation and tracking over the entire manipulation process.

Conclusion

The ActivePose framework introduces a novel approach to addressing the inherent challenges of 6D pose estimation and tracking by integrating next-best-view selection with dynamic pose tracking. It advances the capability of robotic manipulation systems to operate effectively in cluttered and dynamically changing environments, facilitating robust object interaction. Future work intends to explore image-based feature representations to enhance the computational robustness and sensitivity to pose variations.

This work contributes significantly to the field of robotic perception, aiming to eliminate the traditional bottlenecks associated with pose ambiguity and tracking stability. By actively and intelligently adjusting viewpoints, ActivePose sets a new standard for autonomous manipulation tasks requiring high precision and adaptability.