PPF-Tracker: Articulated SE(3) Pose Tracking
- PPF-Tracker is a category-level articulated object pose tracking framework operating in SE(3), utilizing dynamic keyframes and point-pair features.
- It integrates quasi-canonicalization, SE(3)-invariant learning, and tangent space voting to achieve robust tracking under complex kinematic conditions.
- Its design supports real-time applications in robotics and augmented reality through efficient drift management and Gauss–Newton kinematic refinement.
PPF-Tracker is a category-level articulated object pose tracking framework operating in the SE(3) Lie group space, specifically designed to address the challenging problem of multi-part object pose tracking under complex, real-world kinematic conditions. Leveraging quasi-canonicalization and point-pair feature representations, PPF-Tracker integrates SE(3)-invariant learning, pose voting on tangent spaces, and explicit part-joint kinematic constraints. Its full pipeline delivers robust tracking for articulated structures in robotics, augmented reality, and embodied intelligence scenarios.
1. Quasi-Canonicalization on SE(3) Manifolds
PPF-Tracker defines a systematic quasi-canonicalization procedure for articulated objects comprising rigid parts. At each frame , part-wise point clouds and predicted poses are processed in reference to dynamic keyframes. Frames are partitioned into segments indexed by , where each segment runs between successive keyframes.
A keyframe inverse is constructed for each part: , with marking the segment start. Canonicalization transforms incoming clouds within segment via
where in practice, using previous estimates. Relative pose is expressed as
with absolute pose accumulation:
Dynamic Keyframe Selection (DKS) centralizes drift management: after each prediction, energy is computed as
where and are Chamfer and Hausdorff distances. A new keyframe is triggered if , with typical threshold . This mechanism regulates frame reference updates to minimize drift and enhance motion adaptation.
2. Point-Pair Feature Representation for Articulated Objects
PPF-Tracker utilizes rigidity-invariant point-pair features. For points with normals , the directional vector is
and the canonical 4-D PPF encoding is
which is invariant under any rigid transformation .
A learned pair-wise weighting, based on normal angle , is introduced: Biasing against nearly-parallel pairs enhances voting contrast in subsequent network heads. A set of point pairs, each with its weighted PPF and optionally their joint coordinates , is propagated through a PointNet++ backbone capturing relevant geometric relationships.
3. SE(3)-Tangent Pose Voting with Explicit Parameterization
Following feature extraction, the network splits into five prediction heads per part :
- Translation votes:
- Orientation votes:
- Scale regressor:
Let denote the canonical part center, the axes, and as above. The translation parameters
describe circles of possible part centers. The orientation parameters
vote for canonical rotation.
Each PPF casts soft votes, via a small MLP, into discretized translation ( bins) and orientation ( Fibonacci sphere bins) histograms. Maxima are extracted for continuous estimates , and scale is regressed through MSE loss.
From , an element is constructed: where analytical mappings follow Eade (2013). Pose updates are performed in tangent space: with exponential mapping ensuring rotation matrix orthogonality.
4. Kinematic Constraints and Joint-Axis Optimization
The framework incorporates kinematic-constraint refinement for articulated joints. For joints interconnecting parts, revolute joints rotate about axis , prismatic joints slide along it. Joint is characterized by reference point and direction .
Two energy terms define the optimization:
- Geometric alignment per part:
- Kinematic coupling per joint:
with axis and translation constraints depending on joint type.
The total objective,
is minimized, typically via Gauss–Newton, to yield refined pose estimates . This step enforces consistency of joint articulation across parts and frames.
5. Pipeline Overview and Implementation Pseudocode
The PPF-Tracker process operates as a stream on input clouds and initial poses. The following pseudocode details the core steps:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
Input: Frame-stream {P₀ᵏ,P₁ᵏ,…} and initial poses T₀ᵏ.
Output: Refined poses {Tₜᵏ} and scales {sₜᵏ}.
1 Initialize keyframe index i=0, K₀ᵏ = I.
2 For t=1…T:
3 If t begins new segment at keyframe i:
4 Set Kᵏ = (T_tᵏ)⁻¹, reset canonical clouds P_cᵏ.
5 Canonicalize: P̄ₜᵏ ← Kᵏ·Pₜᵏ.
6 Sample N point-pairs {p_i,p_j} from P̄ₜᵏ.
7 Compute (v_{ij},PPF_{ij}) for each pair.
8 Run PointNet++ → features.
9 Predict histograms for (μ,ν), (α,β) and regression for γ.
10 Decode votes → Δξₜᵏ in se(3).
11 Update ξₜᵏ ← ξ_{t-1}ᵏ + Δξₜᵏ.
12 Exponential map → coarse Tₜᵏ = exp(ξₜᵏ).
13 Kinematic refinement → (Tₜᵏ)_{optim}.
14 Compute energy ℰₜ; if ℰₜ<φ: i←i+1 (new keyframe).
15 Output refined (Tₜᵏ)_{optim}, sₜᵏ=γₜᵏ. |
6. Network Architecture, Loss Functions, and Training Protocols
PPF-Tracker deploys a PointNet++ backbone for feature learning over weighted point-pair features. Four heads operate in parallel:
- Translation: Predicts softmax histograms for with bins
- Orientation: Predicts softmax histogram over bins for
- Scale: Regression for
- Mask: Optional part segmentation via binary prediction
Loss functions are constructed as follows:
- Translation and orientation: KL-divergence on softmax voting outputs
- Scale: Mean squared error (MSE)
- Mask: Binary cross-entropy (BCE)
The final loss combines all components: Training is conducted for 200 epochs using Adam optimizer with initial learning rate , decayed by 0.1 every 10 epochs, and input clouds downsampled to 3072 points. Inference is performed per frame with runtime on RTX 4090-class hardware, demonstrating suitability for real-time robotic or AR scenarios.
A plausible implication is that PPF-Tracker's dynamic keyframe mechanism can adapt to unpredictable motion patterns and maintain low drift even in long sequences.
7. Applications and Implementation Considerations
PPF-Tracker is applicable to pose tracking in multi-part robotic manipulators, articulated AR objects, and category-level scene understanding, wherever rigid part motion is constrained by physically plausible kinematic joints. The framework supports extension to broader categories given annotation of joint axes.
Resource requirements are compatible with real-time deployment given modern GPUs, and the modular pipeline with explicit keyframing and refinement facilitates integration with higher-level control, mapping, or semantic segmentation subsystems.
Its empirical generalization across synthetic and real-world scenarios suggests strong domain robustness. For full implementation details, all codes and pretrained models are available at https://github.com/mengxh20/PPFTracker. Lie group background follows Eade (2013).
Below is a concise summary of design choices:
| Component | Key Method | Implementation |
|---|---|---|
| Feature Backbone | PointNet++ | (v_{ij},PPF_{ij}) |
| Voting | Softmax + MLP heads | Histograms, MSE |
| Kinematic Refinement | Gauss–Newton | \mathcal E_{\rm comp} |
| Keyframe Policy | Dynamic, energy-based | Chamfer, Hausdorff |
This synthesis represents the current canonical implementation and research status of PPF-Tracker for articulated pose tracking in SE(3).