Papers
Topics
Authors
Recent
2000 character limit reached

PPF-Tracker: Articulated SE(3) Pose Tracking

Updated 15 November 2025
  • PPF-Tracker is a category-level articulated object pose tracking framework operating in SE(3), utilizing dynamic keyframes and point-pair features.
  • It integrates quasi-canonicalization, SE(3)-invariant learning, and tangent space voting to achieve robust tracking under complex kinematic conditions.
  • Its design supports real-time applications in robotics and augmented reality through efficient drift management and Gauss–Newton kinematic refinement.

PPF-Tracker is a category-level articulated object pose tracking framework operating in the SE(3) Lie group space, specifically designed to address the challenging problem of multi-part object pose tracking under complex, real-world kinematic conditions. Leveraging quasi-canonicalization and point-pair feature representations, PPF-Tracker integrates SE(3)-invariant learning, pose voting on tangent spaces, and explicit part-joint kinematic constraints. Its full pipeline delivers robust tracking for articulated structures in robotics, augmented reality, and embodied intelligence scenarios.

1. Quasi-Canonicalization on SE(3) Manifolds

PPF-Tracker defines a systematic quasi-canonicalization procedure for articulated objects comprising KK rigid parts. At each frame tt, part-wise point clouds Ptk\mathcal P_t^k and predicted poses TtkSE(3)T_t^k\in\mathrm{SE}(3) are processed in reference to dynamic keyframes. Frames are partitioned into segments indexed by ii, where each segment runs between successive keyframes.

A keyframe inverse is constructed for each part: Kik=(Tnk)1\mathcal K_i^k = (T_{n}^k)^{-1}, with nn marking the segment start. Canonicalization transforms incoming clouds within segment ii via

Pˉtk=Kik(Ptk)=(Tnk)1Ttk(Ptk),t[n,[i+1])\bar{\mathcal P}_t^k = \mathcal K_i^k(\mathcal P_t^k) = (T_{n}^k)^{-1}T_t^k(\mathcal P_t^k),\quad t\in[n,[i+1])

where in practice, PˉtkΔTtkPck\bar{\mathcal P}_t^k\approx \Delta T_t^k\cdot \mathcal P_c^k using previous estimates. Relative pose is expressed as

ΔTtk()=Ttk(Tnk)SE(3)\Delta T_t^{k(*)} = T_t^k (T_{n}^k) \in \mathrm{SE}(3)

with absolute pose accumulation: Ttk=(i:[i]tΔT[i]k)T0k.T_t^k = \Bigl( \prod_{i:\,[i]\le t}\Delta T_{[i]}^k \Bigr) T_0^k.

Dynamic Keyframe Selection (DKS) centralizes drift management: after each prediction, energy is computed as

Et=1P(DC(P^t,Pt)+DH(P^t,Pt))\mathfrak E_t = \frac{1}{|\mathcal P|}(D_C(\hat{\mathcal P}_t,\mathcal P_t) + D_H(\hat{\mathcal P}_t,\mathcal P_t))

where DCD_C and DHD_H are Chamfer and Hausdorff distances. A new keyframe is triggered if Et<ϕ\mathfrak E_t<\phi, with typical threshold ϕ=0.01\phi=0.01. This mechanism regulates frame reference updates to minimize drift and enhance motion adaptation.

2. Point-Pair Feature Representation for Articulated Objects

PPF-Tracker utilizes rigidity-invariant point-pair features. For points pi,pjP\mathbf{p}_i,\mathbf{p}_j\in\mathcal P with normals ni,nj\mathbf{n}_i,\mathbf{n}_j, the directional vector is

d=pjpipjpi\mathbf{d} = \frac{\mathbf{p}_j-\mathbf{p}_i}{\|\mathbf{p}_j-\mathbf{p}_i\|}

and the canonical 4-D PPF encoding is

PPF(pi,ni,pj,nj)=(pjpi arccos(nid) arccos(njd) arccos(ninj))\mathrm{PPF}(\mathbf{p}_i,\mathbf{n}_i,\mathbf{p}_j,\mathbf{n}_j) = \begin{pmatrix} \|\mathbf{p}_j-\mathbf{p}_i\| \ \arccos(\mathbf{n}_i\cdot\mathbf{d}) \ \arccos(\mathbf{n}_j\cdot\mathbf{d}) \ \arccos(\mathbf{n}_i\cdot\mathbf{n}_j) \end{pmatrix}

which is invariant under any rigid transformation (R,t)SE(3)(R,t)\in\mathrm{SE}(3).

A learned pair-wise weighting, based on normal angle θij=arccos(ninj)\theta_{ij}=\arccos(\mathbf{n}_i\cdot\mathbf{n}_j), is introduced: vij=1λcosθij,λ=0.5v_{ij} = 1 - \lambda|\cos\theta_{ij}|,\quad \lambda=0.5 Biasing against nearly-parallel pairs enhances voting contrast in subsequent network heads. A set of NN point pairs, each with its weighted PPF and optionally their joint coordinates {pi,pj,ni,nj}\{\mathbf{p}_i,\mathbf{p}_j,\mathbf{n}_i,\mathbf{n}_j\}, is propagated through a PointNet++ backbone capturing relevant geometric relationships.

3. SE(3)-Tangent Pose Voting with Explicit Parameterization

Following feature extraction, the network splits into five prediction heads per part kk:

  • Translation votes: μk,νk\mu^k, \nu^k
  • Orientation votes: αk,βk\alpha^k, \beta^k
  • Scale regressor: γk\gamma^k

Let o\mathbf{o} denote the canonical part center, {e1,e2,e3}\{\mathbf{e}_1,\mathbf{e}_2,\mathbf{e}_3\} the axes, and d\mathbf{d} as above. The translation parameters

μk=(pio)d,νk=(pio)μkd2\mu^k = (\mathbf{p}_i-\mathbf{o})\cdot \mathbf{d},\qquad \nu^k = \|(\mathbf{p}_i-\mathbf{o})-\mu^k\mathbf{d}\|_2

describe circles of possible part centers. The orientation parameters

αk=e1d,βk=e2d\alpha^k = \mathbf{e}_1\cdot \mathbf{d},\qquad \beta^k = \mathbf{e}_2\cdot \mathbf{d}

vote for canonical rotation.

Each PPF casts soft votes, via a small MLP, into discretized translation (BtB_t bins) and orientation (BrB_r Fibonacci sphere bins) histograms. Maxima are extracted for continuous estimates μ^k,ν^k,α^k,β^k\hat\mu^k,\hat\nu^k,\hat\alpha^k,\hat\beta^k, and scale γk\gamma^k is regressed through MSE loss.

From {μ^,ν^,α^,β^}\{\hat\mu,\hat\nu,\hat\alpha,\hat\beta\}, an element Δξkse(3)\Delta\xi^k\in\mathfrak{se}(3) is constructed: Δξk=[[ω^k]×v^k 00],ω^k<π\Delta\xi^k = \begin{bmatrix} [\hat\omega^k]_\times & \hat v^k \ 0 & 0 \end{bmatrix},\quad \|\hat\omega^k\|<\pi where analytical mappings follow Eade (2013). Pose updates are performed in tangent space: ξtk=ξt1k+Δξtk,T^tk=exp(ξtk)\xi_t^k = \xi_{t-1}^k + \Delta\xi_t^k,\qquad \hat T_t^k = \exp(\xi_t^k) with exponential mapping ensuring rotation matrix orthogonality.

4. Kinematic Constraints and Joint-Axis Optimization

The framework incorporates kinematic-constraint refinement for articulated joints. For JJ joints interconnecting KK parts, revolute joints rotate about axis j\ell_j, prismatic joints slide along it. Joint jj is characterized by reference point qj\mathbf{q}^j and direction uj\mathbf{u}^j.

Two energy terms define the optimization:

  1. Geometric alignment per part:

Egeo=k=1KTk1PtkPckF2\mathcal E_{\rm geo} = \sum_{k=1}^K \| T^k{}^{-1}\mathcal P_t^k - \mathcal P_c^k \|_F^2

  1. Kinematic coupling per joint:

Ekin=j=1J1Tj(qj)Tj+1(qj)22\mathcal E_{\rm kin} = \sum_{j=1}^{J-1} \|T^j(q^j)' - T^{j+1}(q^j)'\|_2^2

with axis and translation constraints depending on joint type.

The total objective,

Ecomp=Egeo+λkinEkin\mathcal E_{\rm comp} = \mathcal E_{\rm geo} + \lambda_{\rm kin}\mathcal E_{\rm kin}

is minimized, typically via Gauss–Newton, to yield refined pose estimates (T^tk)optim(\hat T_t^k)_{\rm optim}. This step enforces consistency of joint articulation across parts and frames.

5. Pipeline Overview and Implementation Pseudocode

The PPF-Tracker process operates as a stream on input clouds and initial poses. The following pseudocode details the core steps:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Input:   Frame-stream {P₀ᵏ,P₁ᵏ,…} and initial poses T₀ᵏ.
Output:  Refined poses {Tₜᵏ} and scales {sₜᵏ}.
1  Initialize keyframe index i=0, K₀ᵏ = I.
2  For t=1…T:
3    If t begins new segment at keyframe i:
4      Set Kᵏ = (T_tᵏ)⁻¹, reset canonical clouds P_cᵏ.
5    Canonicalize:  P̄ₜᵏ ← Kᵏ·Pₜᵏ.
6    Sample N point-pairs {p_i,p_j} from P̄ₜᵏ.
7    Compute (v_{ij},PPF_{ij}) for each pair.
8    Run PointNet++ → features.
9    Predict histograms for (μ,ν), (α,β) and regression for γ.
10   Decode votes → Δξₜᵏ in se(3).
11   Update ξₜᵏ ← ξ_{t-1}ᵏ + Δξₜᵏ.
12   Exponential map → coarse Tₜᵏ = exp(ξₜᵏ).
13   Kinematic refinement → (Tₜᵏ)_{optim}.
14   Compute energy ℰₜ; if ℰₜ<φ: i←i+1 (new keyframe).
15   Output refined (Tₜᵏ)_{optim}, sₜᵏ=γₜᵏ.
This single-stream design supports online operation and naturally accommodates dynamic keyframe selection and kinematic refinement.

6. Network Architecture, Loss Functions, and Training Protocols

PPF-Tracker deploys a PointNet++ backbone for feature learning over weighted point-pair features. Four heads operate in parallel:

  • Translation: Predicts softmax histograms for μ,ν\mu, \nu with BtB_t bins
  • Orientation: Predicts softmax histogram over BrB_r bins for α+β\alpha+\beta
  • Scale: Regression for γ\gamma
  • Mask: Optional part segmentation via binary prediction

Loss functions are constructed as follows:

  • Translation and orientation: KL-divergence on softmax voting outputs
  • Scale: Mean squared error (MSE)
  • Mask: Binary cross-entropy (BCE)

The final loss combines all components: L=0.3Ltrans+0.3Lorient+0.2Lscale+0.2Lmask\mathcal L = 0.3\,\mathcal L_{\rm trans} + 0.3\,\mathcal L_{\rm orient} + 0.2\,\mathcal L_{\rm scale} + 0.2\,\mathcal L_{\rm mask} Training is conducted for 200 epochs using Adam optimizer with initial learning rate 1e-31\text{e-}3, decayed by 0.1 every 10 epochs, and input clouds downsampled to 3072 points. Inference is performed per frame with runtime 0.07s/frame\approx0.07\,\text{s/frame} on RTX 4090-class hardware, demonstrating suitability for real-time robotic or AR scenarios.

A plausible implication is that PPF-Tracker's dynamic keyframe mechanism can adapt to unpredictable motion patterns and maintain low drift even in long sequences.

7. Applications and Implementation Considerations

PPF-Tracker is applicable to pose tracking in multi-part robotic manipulators, articulated AR objects, and category-level scene understanding, wherever rigid part motion is constrained by physically plausible kinematic joints. The framework supports extension to broader categories given annotation of joint axes.

Resource requirements are compatible with real-time deployment given modern GPUs, and the modular pipeline with explicit keyframing and refinement facilitates integration with higher-level control, mapping, or semantic segmentation subsystems.

Its empirical generalization across synthetic and real-world scenarios suggests strong domain robustness. For full implementation details, all codes and pretrained models are available at https://github.com/mengxh20/PPFTracker. Lie group background follows Eade (2013).

Below is a concise summary of design choices:

Component Key Method Implementation
Feature Backbone PointNet++ (v_{ij},PPF_{ij})
Voting Softmax + MLP heads Histograms, MSE
Kinematic Refinement Gauss–Newton \mathcal E_{\rm comp}
Keyframe Policy Dynamic, energy-based Chamfer, Hausdorff

This synthesis represents the current canonical implementation and research status of PPF-Tracker for articulated pose tracking in SE(3).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to PPF-Tracker.