Jointformer: 3D Pose & VOS Transformers
- The paper introduces a Transformer-based approach for monocular 3D human pose estimation, replacing hand-crafted joint graphs with learned self-attention and error refinement modules.
- A dedicated error prediction module and lightweight MLP refinement block boost accuracy, reducing MPJPE by 1–3 mm compared to previous methods.
- The second Jointformer model applies joint feature correspondence and compressed memory tokens to achieve state-of-the-art performance in video object segmentation.
Jointformer refers to two independently developed Transformer-based architectures: one for monocular 3D human pose estimation from a single image (Lutz et al., 2022), and another for video object segmentation (VOS) with integrated feature matching and memory modeling (Zhang et al., 2023). Despite distinct domains, both systems share the commonality of leveraging joint modeling within the Transformer framework, in the sense of capturing complex interdependencies—whether between human body joints or between frames and objects in a video sequence. The following article presents each variant in detail.
1. Monocular 3D Human Pose Estimation: Jointformer Architecture
The original Jointformer addresses lifting a single 2D human skeleton (predicted keypoints) to a 3D pose using a Transformer encoder with specialized mechanisms for error prediction and refinement (Lutz et al., 2022). The approach replaces previous graph convolutional network (GCN) methods, which depend on manually defined joint relationships, with a generalized self-attention mechanism that learns these dependencies implicitly from data.
Key architectural steps:
- Tokenization: Each of 2D joints, given by normalized coordinates , is projected to a -dimensional token by a learned affine transform plus a joint-type embedding.
- Stacked Transformer Encoder: identical transformer-encoder layers operate on the full set of joint tokens. Inside each layer, tokens are projected to queries, keys, and values, and multi-head self-attention computes high-order dependencies among joints. Residual connections and position-wise feedforward sublayers are adopted throughout.
- Intermediate Supervision: Auxiliary 3D-pose regression heads are attached to selected encoder depths (e.g., layers 4 and 8), enforcing useful intermediate feature learning through per-joint L2 errors.
2. Error Prediction and Refinement Modules
Jointformer augments standard 2D→3D lifting with two mechanisms:
- Error Prediction: Alongside the main regression head, each joint token is passed through an MLP that predicts its own 3D reconstruction error (targeted to match the ground-truth error ). This is supervised with an L1 loss.
- Refinement Block: Each final joint token is concatenated with its predicted error and passed through another lightweight MLP to produce a 3D "offset" added to the original prediction, yielding a refined estimate . This module typically adds fewer than 20k parameters.
Ablation studies demonstrate that both intermediate supervision and error-driven refinement yield statistically significant MPJPE reductions. Removing residual connections or error-prediction components degrades performance by 1–3 mm.
3. Training and Evaluation Protocols
Datasets: Jointformer is trained primarily on Human3.6M under Protocol 1 (subjects S1,5,6,7,8 for training; S9,11 for testing) and validated on MPI-INF-3DHP with standard splits.
Optimization: Training employs AdamW with weight decay , initial LR , cosine-annealing with warm restarts (every 40 epochs), for 120 total epochs. Each batch comprises 256 poses over 4 GPUs, with input data augmented via flipping and small-scale/rotation perturbations. Input 2D poses must be normalized to zero mean and unit covariance before tokenization.
Performance: The complete model, with dimensional tokens, layers, heads per layer, and dropout 0.1 after each block, achieves 43.1 mm MPJPE on Human3.6M Protocol 1, exceeding previous single-frame transformer-based approaches and matching some video-based models.
4. Video Object Segmentation: Jointformer for Joint Feature–Correspondence–Memory Modeling
A distinct JointFormer model was later introduced for VOS, integrating feature extraction, correspondence matching, and compressed memory modeling within a unifying frame-wise Transformer (Zhang et al., 2023).
Core pipeline:
- Input Preparation: For each timestep , the current frame and up to reference pairs (frames + object masks) are split into patches and linearly embedded to respective token sets (current), (reference).
- Compressed Memory Token: A single learned token per object accumulates temporally aggregated information, enforcing instance-level memory with constant size.
- Joint Transformer Backbone: Joint Blocks operate on via multi-head attention. Key novelty lies in the flexible masking of keys/values for each query group, permitting rich feature co-evolution, pixelwise matching, and online memory updating.
After such layers, the final current frame tokens and memory are combined via cross-attention, and upsampling convolutional decoders produce per-pixel object segmentation masks.
5. Joint Block Functionality and Memory Update Dynamics
Each Joint Block contains LayerNorm, MultiHeadAttention, and MLP with residual connections. For the main query subsets:
- Current tokens attend to both themselves and reference tokens, facilitating dense correspondence and fine-grained feature transfer.
- Reference tokens attend to only their own group (prevents contamination from unconstrained current pixels).
- Memory token attends to reference tokens or up-to-date context from the prior frame's decoder tokens, enabling continual updating.
During online inference, the memory token is recurrently passed forward and updated, driving consistent object representation across long video sequences.
6. Objective Functions, Performance Benchmarks, and Ablations
Loss Function: The final segmentation logit is trained with a combined BCE and Dice loss, using
Benchmarks and Results:
- DAVIS 2017 (val / test-dev): 89.7% / 87.6%
- YouTube-VOS 2018/2019 (val): 87.0% These results, obtained without massive extra pretraining, constitute 2–3 point gains over previous SOTA models.
Ablation studies illustrate the critical importance of (a) all-layer joint modeling for detail preservation (~2% benefit), (b) memory token design for instance distinctiveness, and (c) control of attention routing to avoid spurious background leakage. MAE pretraining on ImageNet improves VOS readiness by 5–6% over random initialization.
7. Comparative Analysis and Impact
Both Jointformer architectures demonstrate the efficacy of joint modeling via Transformer self-attention for spatial or temporal relationships—not restricted to 2D-to-3D lifting or feature segmentation. In 3D pose estimation, error prediction and refinement outperform vanilla Transformer and GCN architectures (Lutz et al., 2022). In VOS, the joint attention at all layers and compressed memory outperform RNN and pure matching-based models (Zhang et al., 2023). A plausible implication is that transformer-based joint modeling with custom token routing and auxiliary heads can generalize to other structure-aware perceptual tasks requiring dense or instance-level reasoning.