Jointformer: 3D Pose & VOS Transformers

Updated 21 February 2026

The paper introduces a Transformer-based approach for monocular 3D human pose estimation, replacing hand-crafted joint graphs with learned self-attention and error refinement modules.
A dedicated error prediction module and lightweight MLP refinement block boost accuracy, reducing MPJPE by 1–3 mm compared to previous methods.
The second Jointformer model applies joint feature correspondence and compressed memory tokens to achieve state-of-the-art performance in video object segmentation.

Jointformer refers to two independently developed Transformer-based architectures: one for monocular 3D human pose estimation from a single image (Lutz et al., 2022), and another for video object segmentation (VOS) with integrated feature matching and memory modeling (Zhang et al., 2023). Despite distinct domains, both systems share the commonality of leveraging joint modeling within the Transformer framework, in the sense of capturing complex interdependencies—whether between human body joints or between frames and objects in a video sequence. The following article presents each variant in detail.

1. Monocular 3D Human Pose Estimation: Jointformer Architecture

The original Jointformer addresses lifting a single 2D human skeleton (predicted keypoints) to a 3D pose using a Transformer encoder with specialized mechanisms for error prediction and refinement (Lutz et al., 2022). The approach replaces previous graph convolutional network (GCN) methods, which depend on manually defined joint relationships, with a generalized self-attention mechanism that learns these dependencies implicitly from data.

Key architectural steps:

Tokenization: Each of $J$ 2D joints, given by normalized coordinates $(x_i, y_i)$ , is projected to a $D$ -dimensional token by a learned affine transform plus a joint-type embedding.
Stacked Transformer Encoder: $L$ identical transformer-encoder layers operate on the full set of joint tokens. Inside each layer, tokens are projected to queries, keys, and values, and multi-head self-attention computes high-order dependencies among joints. Residual connections and position-wise feedforward sublayers are adopted throughout.
Intermediate Supervision: Auxiliary 3D-pose regression heads are attached to selected encoder depths (e.g., layers 4 and 8), enforcing useful intermediate feature learning through per-joint L2 errors.

Jointformer augments standard 2D→3D lifting with two mechanisms:

Error Prediction: Alongside the main regression head, each joint token is passed through an MLP that predicts its own 3D reconstruction error (targeted to match the ground-truth error $\|y_i - y_i^*\|_2$ ). This is supervised with an L1 loss.
Refinement Block: Each final joint token is concatenated with its predicted error and passed through another lightweight MLP to produce a 3D "offset" added to the original prediction, yielding a refined estimate $\hat y_i = y_i + \Delta y_i$ . This module typically adds fewer than 20k parameters.

Ablation studies demonstrate that both intermediate supervision and error-driven refinement yield statistically significant MPJPE reductions. Removing residual connections or error-prediction components degrades performance by 1–3 mm.

3. Training and Evaluation Protocols

Datasets: Jointformer is trained primarily on Human3.6M under Protocol 1 (subjects S1,5,6,7,8 for training; S9,11 for testing) and validated on MPI-INF-3DHP with standard splits.

Optimization: Training employs AdamW with weight decay $1 \times 10^{-5}$ , initial LR $1 \times 10^{-3}$ , cosine-annealing with warm restarts (every 40 epochs), for 120 total epochs. Each batch comprises 256 poses over 4 GPUs, with input data augmented via flipping and small-scale/rotation perturbations. Input 2D poses must be normalized to zero mean and unit covariance before tokenization.

Performance: The complete model, with $D=256$ dimensional tokens, $L=12$ layers, $H=8$ heads per layer, and dropout 0.1 after each block, achieves 43.1 mm MPJPE on Human3.6M Protocol 1, exceeding previous single-frame transformer-based approaches and matching some video-based models.

4. Video Object Segmentation: Jointformer for Joint Feature–Correspondence–Memory Modeling

A distinct JointFormer model was later introduced for VOS, integrating feature extraction, correspondence matching, and compressed memory modeling within a unifying frame-wise Transformer (Zhang et al., 2023).

Core pipeline:

Input Preparation: For each timestep $t$ , the current frame $X_t$ and up to $T$ reference pairs (frames + object masks) are split into patches and linearly embedded to respective token sets $\mathbf{F}_x^0$ (current), $\mathbf{F}_r^0$ (reference).
Compressed Memory Token: A single learned token $M^0 \in \mathbb{R}^{1 \times C}$ per object accumulates temporally aggregated information, enforcing instance-level memory with constant size.
Joint Transformer Backbone: $N$ Joint Blocks operate on $[\mathbf{F}_x, \mathbf{F}_r, M]$ via multi-head attention. Key novelty lies in the flexible masking of keys/values for each query group, permitting rich feature co-evolution, pixelwise matching, and online memory updating.

After $N$ such layers, the final current frame tokens and memory are combined via cross-attention, and upsampling convolutional decoders produce per-pixel object segmentation masks.

5. Joint Block Functionality and Memory Update Dynamics

Each Joint Block contains LayerNorm, MultiHeadAttention, and MLP with residual connections. For the main query subsets:

Current tokens attend to both themselves and reference tokens, facilitating dense correspondence and fine-grained feature transfer.
Reference tokens attend to only their own group (prevents contamination from unconstrained current pixels).
Memory token attends to reference tokens or up-to-date context from the prior frame's decoder tokens, enabling continual updating.

During online inference, the memory token is recurrently passed forward and updated, driving consistent object representation across long video sequences.

6. Objective Functions, Performance Benchmarks, and Ablations

Loss Function: The final segmentation logit is trained with a combined BCE and Dice loss, using

$\mathcal{L} = \lambda\, \mathcal{L}_{\mathrm{BCE}} + (1-\lambda) \mathcal{L}_{\mathrm{Dice}}, \quad \lambda=0.5$

Benchmarks and Results:

DAVIS 2017 (val / test-dev): 89.7% / 87.6%
YouTube-VOS 2018/2019 (val): 87.0% These results, obtained without massive extra pretraining, constitute 2–3 point gains over previous SOTA models.

Ablation studies illustrate the critical importance of (a) all-layer joint modeling for detail preservation (~2% benefit), (b) memory token design for instance distinctiveness, and (c) control of attention routing to avoid spurious background leakage. MAE pretraining on ImageNet improves VOS readiness by 5–6% over random initialization.

7. Comparative Analysis and Impact

Both Jointformer architectures demonstrate the efficacy of joint modeling via Transformer self-attention for spatial or temporal relationships—not restricted to 2D-to-3D lifting or feature segmentation. In 3D pose estimation, error prediction and refinement outperform vanilla Transformer and GCN architectures (Lutz et al., 2022). In VOS, the joint attention at all layers and compressed memory outperform RNN and pure matching-based models (Zhang et al., 2023). A plausible implication is that transformer-based joint modeling with custom token routing and auxiliary heads can generalize to other structure-aware perceptual tasks requiring dense or instance-level reasoning.

Markdown Report Issue Upgrade to Chat

References (2)

Jointformer: Single-Frame Lifting Transformer with Error Prediction and Refinement for 3D Human Pose Estimation (2022)

Joint Modeling of Feature, Correspondence, and a Compressed Memory for Video Object Segmentation (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Jointformer.

Jointformer: 3D Pose & VOS Transformers

1. Monocular 3D Human Pose Estimation: Jointformer Architecture

2. Error Prediction and Refinement Modules

3. Training and Evaluation Protocols

4. Video Object Segmentation: Jointformer for Joint Feature–Correspondence–Memory Modeling

5. Joint Block Functionality and Memory Update Dynamics

6. Objective Functions, Performance Benchmarks, and Ablations

7. Comparative Analysis and Impact

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Jointformer: 3D Pose & VOS Transformers

1. Monocular 3D Human Pose Estimation: Jointformer Architecture

2. Error Prediction and Refinement Modules

3. Training and Evaluation Protocols

4. Video Object Segmentation: Jointformer for Joint Feature–Correspondence–Memory Modeling

5. Joint Block Functionality and Memory Update Dynamics

6. Objective Functions, Performance Benchmarks, and Ablations

7. Comparative Analysis and Impact

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research