PPF-Tracker: Articulated SE(3) Pose Tracking

Updated 15 November 2025

PPF-Tracker is a category-level articulated object pose tracking framework operating in SE(3), utilizing dynamic keyframes and point-pair features.
It integrates quasi-canonicalization, SE(3)-invariant learning, and tangent space voting to achieve robust tracking under complex kinematic conditions.
Its design supports real-time applications in robotics and augmented reality through efficient drift management and Gauss–Newton kinematic refinement.

PPF-Tracker is a category-level articulated object pose tracking framework operating in the SE(3) Lie group space, specifically designed to address the challenging problem of multi-part object pose tracking under complex, real-world kinematic conditions. Leveraging quasi-canonicalization and point-pair feature representations, PPF-Tracker integrates SE(3)-invariant learning, pose voting on tangent spaces, and explicit part-joint kinematic constraints. Its full pipeline delivers robust tracking for articulated structures in robotics, augmented reality, and embodied intelligence scenarios.

1. Quasi-Canonicalization on SE(3) Manifolds

PPF-Tracker defines a systematic quasi-canonicalization procedure for articulated objects comprising $K$ rigid parts. At each frame %%%%1%%%%, part-wise point clouds $\mathcal P_t^k$ and predicted poses $T_t^k\in\mathrm{SE}(3)$ are processed in reference to dynamic keyframes. Frames are partitioned into segments indexed by $i$ , where each segment runs between successive keyframes.

A keyframe inverse is constructed for each part: $\mathcal K_i^k = (T_{n}^k)^{-1}$ , with $n$ marking the segment start. Canonicalization transforms incoming clouds within segment $i$ via

$\bar{\mathcal P}_t^k = \mathcal K_i^k(\mathcal P_t^k) = (T_{n}^k)^{-1}T_t^k(\mathcal P_t^k),\quad t\in[n,[i+1])$

where in practice, $\bar{\mathcal P}_t^k\approx \Delta T_t^k\cdot \mathcal P_c^k$ using previous estimates. Relative pose is expressed as

$\Delta T_t^{k(*)} = T_t^k (T_{n}^k) \in \mathrm{SE}(3)$

with absolute pose accumulation: $T_t^k = \Bigl( \prod_{i:\,[i]\le t}\Delta T_{[i]}^k \Bigr) T_0^k.$

Dynamic Keyframe Selection (DKS) centralizes drift management: after each prediction, energy is computed as

$\mathfrak E_t = \frac{1}{|\mathcal P|}(D_C(\hat{\mathcal P}_t,\mathcal P_t) + D_H(\hat{\mathcal P}_t,\mathcal P_t))$

where $D_C$ and $D_H$ are Chamfer and Hausdorff distances. A new keyframe is triggered if $\mathfrak E_t<\phi$ , with typical threshold $\phi=0.01$ . This mechanism regulates frame reference updates to minimize drift and enhance motion adaptation.

2. Point-Pair Feature Representation for Articulated Objects

PPF-Tracker utilizes rigidity-invariant point-pair features. For points $\mathbf{p}_i,\mathbf{p}_j\in\mathcal P$ with normals $\mathbf{n}_i,\mathbf{n}_j$ , the directional vector is

$\mathbf{d} = \frac{\mathbf{p}_j-\mathbf{p}_i}{\|\mathbf{p}_j-\mathbf{p}_i\|}$

and the canonical 4-D PPF encoding is

$\mathrm{PPF}(\mathbf{p}_i,\mathbf{n}_i,\mathbf{p}_j,\mathbf{n}_j) = \begin{pmatrix} \|\mathbf{p}_j-\mathbf{p}_i\| \ \arccos(\mathbf{n}_i\cdot\mathbf{d}) \ \arccos(\mathbf{n}_j\cdot\mathbf{d}) \ \arccos(\mathbf{n}_i\cdot\mathbf{n}_j) \end{pmatrix}$

which is invariant under any rigid transformation $(R,t)\in\mathrm{SE}(3)$ .

A learned pair-wise weighting, based on normal angle $\theta_{ij}=\arccos(\mathbf{n}_i\cdot\mathbf{n}_j)$ , is introduced: $v_{ij} = 1 - \lambda|\cos\theta_{ij}|,\quad \lambda=0.5$ Biasing against nearly-parallel pairs enhances voting contrast in subsequent network heads. A set of $N$ point pairs, each with its weighted PPF and optionally their joint coordinates $\{\mathbf{p}_i,\mathbf{p}_j,\mathbf{n}_i,\mathbf{n}_j\}$ , is propagated through a PointNet++ backbone capturing relevant geometric relationships.

3. SE(3)-Tangent Pose Voting with Explicit Parameterization

Following feature extraction, the network splits into five prediction heads per part $k$ :

Translation votes: $\mu^k, \nu^k$
Orientation votes: $\alpha^k, \beta^k$
Scale regressor: $\gamma^k$

Let $\mathbf{o}$ denote the canonical part center, $\{\mathbf{e}_1,\mathbf{e}_2,\mathbf{e}_3\}$ the axes, and $\mathbf{d}$ as above. The translation parameters

$\mu^k = (\mathbf{p}_i-\mathbf{o})\cdot \mathbf{d},\qquad \nu^k = \|(\mathbf{p}_i-\mathbf{o})-\mu^k\mathbf{d}\|_2$

describe circles of possible part centers. The orientation parameters

$\alpha^k = \mathbf{e}_1\cdot \mathbf{d},\qquad \beta^k = \mathbf{e}_2\cdot \mathbf{d}$

vote for canonical rotation.

Each PPF casts soft votes, via a small MLP, into discretized translation ( $B_t$ bins) and orientation ( $B_r$ Fibonacci sphere bins) histograms. Maxima are extracted for continuous estimates $\hat\mu^k,\hat\nu^k,\hat\alpha^k,\hat\beta^k$ , and scale $\gamma^k$ is regressed through MSE loss.

From $\{\hat\mu,\hat\nu,\hat\alpha,\hat\beta\}$ , an element $\Delta\xi^k\in\mathfrak{se}(3)$ is constructed: $\Delta\xi^k = \begin{bmatrix} [\hat\omega^k]_\times & \hat v^k \ 0 & 0 \end{bmatrix},\quad \|\hat\omega^k\|<\pi$ where analytical mappings follow Eade (2013). Pose updates are performed in tangent space: $\xi_t^k = \xi_{t-1}^k + \Delta\xi_t^k,\qquad \hat T_t^k = \exp(\xi_t^k)$ with exponential mapping ensuring rotation matrix orthogonality.

4. Kinematic Constraints and Joint-Axis Optimization

The framework incorporates kinematic-constraint refinement for articulated joints. For $J$ joints interconnecting $K$ parts, revolute joints rotate about axis $\ell_j$ , prismatic joints slide along it. Joint $j$ is characterized by reference point $\mathbf{q}^j$ and direction $\mathbf{u}^j$ .

Two energy terms define the optimization:

Geometric alignment per part:

$\mathcal E_{\rm geo} = \sum_{k=1}^K \| T^k{}^{-1}\mathcal P_t^k - \mathcal P_c^k \|_F^2$

Kinematic coupling per joint:

$\mathcal E_{\rm kin} = \sum_{j=1}^{J-1} \|T^j(q^j)' - T^{j+1}(q^j)'\|_2^2$

with axis and translation constraints depending on joint type.

The total objective,

$\mathcal E_{\rm comp} = \mathcal E_{\rm geo} + \lambda_{\rm kin}\mathcal E_{\rm kin}$

is minimized, typically via Gauss–Newton, to yield refined pose estimates $(\hat T_t^k)_{\rm optim}$ . This step enforces consistency of joint articulation across parts and frames.

5. Pipeline Overview and Implementation Pseudocode

The PPF-Tracker process operates as a stream on input clouds and initial poses. The following pseudocode details the core steps:

Input:   Frame-stream {P₀ᵏ,P₁ᵏ,…} and initial poses T₀ᵏ.
Output:  Refined poses {Tₜᵏ} and scales {sₜᵏ}.
1  Initialize keyframe index i=0, K₀ᵏ = I.
2  For t=1…T:
3    If t begins new segment at keyframe i:
4      Set Kᵏ = (T_tᵏ)⁻¹, reset canonical clouds P_cᵏ.
5    Canonicalize:  P̄ₜᵏ ← Kᵏ·Pₜᵏ.
6    Sample N point-pairs {p_i,p_j} from P̄ₜᵏ.
7    Compute (v_{ij},PPF_{ij}) for each pair.
8    Run PointNet++ → features.
9    Predict histograms for (μ,ν), (α,β) and regression for γ.
10   Decode votes → Δξₜᵏ in se(3).
11   Update ξₜᵏ ← ξ_{t-1}ᵏ + Δξₜᵏ.
12   Exponential map → coarse Tₜᵏ = exp(ξₜᵏ).
13   Kinematic refinement → (Tₜᵏ)_{optim}.
14   Compute energy ℰₜ; if ℰₜ<φ: i←i+1 (new keyframe).
15   Output refined (Tₜᵏ)_{optim}, sₜᵏ=γₜᵏ.

This single-stream design supports online operation and naturally accommodates dynamic keyframe selection and kinematic refinement.

6. Network Architecture, Loss Functions, and Training Protocols

PPF-Tracker deploys a PointNet++ backbone for feature learning over weighted point-pair features. Four heads operate in parallel:

Translation: Predicts softmax histograms for $\mu, \nu$ with $B_t$ bins
Orientation: Predicts softmax histogram over $B_r$ bins for $\alpha+\beta$
Scale: Regression for $\gamma$
Mask: Optional part segmentation via binary prediction

Loss functions are constructed as follows:

Translation and orientation: KL-divergence on softmax voting outputs
Scale: Mean squared error (MSE)
Mask: Binary cross-entropy (BCE)

The final loss combines all components: $\mathcal L = 0.3\,\mathcal L_{\rm trans} + 0.3\,\mathcal L_{\rm orient} + 0.2\,\mathcal L_{\rm scale} + 0.2\,\mathcal L_{\rm mask}$ Training is conducted for 200 epochs using Adam optimizer with initial learning rate $1\text{e-}3$ , decayed by 0.1 every 10 epochs, and input clouds downsampled to 3072 points. Inference is performed per frame with runtime $\approx0.07\,\text{s/frame}$ on RTX 4090-class hardware, demonstrating suitability for real-time robotic or AR scenarios.

A plausible implication is that PPF-Tracker's dynamic keyframe mechanism can adapt to unpredictable motion patterns and maintain low drift even in long sequences.

7. Applications and Implementation Considerations

PPF-Tracker is applicable to pose tracking in multi-part robotic manipulators, articulated AR objects, and category-level scene understanding, wherever rigid part motion is constrained by physically plausible kinematic joints. The framework supports extension to broader categories given annotation of joint axes.

Resource requirements are compatible with real-time deployment given modern GPUs, and the modular pipeline with explicit keyframing and refinement facilitates integration with higher-level control, mapping, or semantic segmentation subsystems.

Its empirical generalization across synthetic and real-world scenarios suggests strong domain robustness. For full implementation details, all codes and pretrained models are available at https://github.com/mengxh20/PPFTracker. Lie group background follows Eade (2013).

Below is a concise summary of design choices:

Component	Key Method	Implementation
Feature Backbone	PointNet++	(v_{ij},PPF_{ij})
Voting	Softmax + MLP heads	Histograms, MSE
Kinematic Refinement	Gauss–Newton	\mathcal E_{\rm comp}
Keyframe Policy	Dynamic, energy-based	Chamfer, Hausdorff

This synthesis represents the current canonical implementation and research status of PPF-Tracker for articulated pose tracking in SE(3).

Markdown Upgrade to Chat

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PPF-Tracker.