End-to-End Multi-Person 2D Pose Estimation

Updated 15 April 2026

End-to-end MPPE is a unified approach that directly infers the number of people, their spatial extents, and their anatomical keypoints from an image in a single differentiable pass.
It integrates various architectures—such as recurrent ConvLSTM, explicit box detection, centroid-guided grouping, and transformer-based set prediction—into a cohesive model.
The systems optimize multi-task losses (e.g., heatmap MSE, embedding losses, and Hungarian matching) to achieve state-of-the-art accuracy and efficient inference even in crowded scenes.

End-to-end multi-person 2D pose estimation (MPPE) refers to the family of methods that, given a single image (or a sequence, in the case of video), directly and simultaneously infer the number of people, their spatial extents, and the precise 2D locations of their anatomical keypoints—typically body joints—without reliance on non-differentiable post-processing, heuristic grouping, or explicit person detection as a separate stage. Modern approaches unify keypoint detection and instance association into a single trainable model such that all parameters, losses, and prediction dependencies are co-optimized in a fully differentiable computation graph. This paradigm has subsumed classic top-down and bottom-up separation, yielding a spectrum of architectures that exploit CNNs, RNNs, and, more recently, transformer-based set prediction mechanisms.

1. Architectural Taxonomy

End-to-end MPPE systems employ various architectural principles to jointly localize and associate person keypoints:

Recurrent Pose Extraction: The two-stage pipeline introduced by Newell et al. uses a Stacked Hourglass CNN to produce dense joint heatmaps and associative embedding features, which are then consumed by a recurrent ConvLSTM module. At each step, the ConvLSTM predicts disjoint joint heatmaps for a single person and a learned confidence signal for stopping, enabling the network to sequentially “extract” person-wise poses without explicit clustering (Briq et al., 2019).
Explicit Box Detection: ED-Pose introduces cascaded explicit box detection, formulating both global person localization and per-joint regression as parameterized axis-aligned box predictions within a unified transformer decoder. The architecture alternates between human detection and keypoint box regression, using L1 and OKS-based losses to supervise both stages in an end-to-end differentiable manner (Yang et al., 2023).
Centroid-Guided Grouping: Bottom-up approaches, such as the DS-Hourglass model, predict per-joint heatmaps, centroid heatmaps, and offset fields from each joint to its parent (often a person centroid). A lightweight greedy assignment maps joints to centroids, assembling full poses with minimal overhead and constant memory footprint, preserving end-to-end trainability up to the grouping stage (Ou et al., 2020).
Transformer-based Set Prediction: Modern transformer models (e.g., Group Pose, POET, DETRPose) treat the entire scene as a direct set prediction problem. These models are built around a CNN or multi-scale encoder whose output tokens seed a transformer decoder tasked with predicting, for each of N queries, the full set of keypoints for a hypothesized person. Bipartite Hungarian matching and global set-based losses ensure that instance partitioning and keypoint regression are jointly optimized (Stoffl et al., 2021, Liu et al., 2023, Janampa et al., 16 Jun 2025).
End-to-End Video Pose Estimation: PAVE-Net demonstrates an extension to the video domain, utilizing a pose-aware spatiotemporal transformer. The model directly aggregates evidence for pose queries across neighboring frames and integrates joint-level refinement, achieving fully end-to-end operation and state-of-the-art accuracy on multi-person video pose benchmarks (Yu et al., 17 Nov 2025).

2. Loss Functions and Optimization

End-to-end MPPE systems leverage composite multi-task losses designed to supervise joint localization, instance association, and, in some cases, auxiliary cues:

Heatmap-Based Detection: Mean squared error (MSE) between predicted and ground-truth Gaussian heatmaps is standard for dense, per-joint detection (Briq et al., 2019, Ou et al., 2020, Li et al., 2019).
Embedding Losses for Association: Pull-and-push terms penalize embeddings of joints from distinct persons being too similar and encourage proximity for joints of the same person (Briq et al., 2019).
Bipartite Matching for Set Prediction: Hungarian assignment is used to align predicted person-wise pose sets with ground-truth instances, with composite costs involving L1/OKS losses on joint coordinates and cross-entropy on presence/no-object classification (Stoffl et al., 2021, Yang et al., 2023, Liu et al., 2023, Janampa et al., 16 Jun 2025).
Association Losses: Some methods use spatial offsets and object keypoint similarity (OKS) metrics to align predictions with the evaluation standard (e.g., YOLO-Pose) (Maji et al., 2022).
Recurrent/Stopping Loss: RNN-based extractors include a binary cross-entropy loss on stopping confidence, ensuring the network predicts as many poses as there are persons in the scene (Briq et al., 2019).
Joint Visibility Losses: Transformers such as POET and DETRPose include joint-specific visibility terms to improve robustness in occlusion (Stoffl et al., 2021, Janampa et al., 16 Jun 2025).

3. Grouping and Association Strategies

Association of detected keypoints into coherent person-level poses is a central challenge. End-to-end MPPE eliminates heuristic grouping:

Recurrent Extraction: Associations are implicit in the sequence of extracted poses by the ConvLSTM, which conditions on previous predictions and the current spatial context (Briq et al., 2019).
Keypoint and Instance Queries: Query-based transformers such as Group Pose and DETRPose assign each query block the role of either a specific joint or an instance, leveraging grouped self-attention to restrict interaction within and across instances and joint types (Liu et al., 2023, Janampa et al., 16 Jun 2025).
Direct Regression with Assignment: Anchor-free methods like DirectPose regress K keypoints at each location and use non-maximum suppression on composite bounding regions, avoiding explicit posthoc grouping (Tian et al., 2019).
Box or Centroid-based Schemes: Some architectures regress boxes for persons and keypoints, aligning outputs using nearest centroid or Hungarian assignment; for example, ED-Pose's explicit box regression jointly optimizes for person and joint localization (Yang et al., 2023, Ou et al., 2020).
Set Prediction: POET and related DETR-style models construct the final set of person-wise poses by direct set regression, with the matching process ensuring unique assignments per person (Stoffl et al., 2021, Janampa et al., 16 Jun 2025).

4. Training and Evaluation Protocols

End-to-end MPPE systems employ rigorous, dataset-driven training regimens:

Data Augmentation: Random cropping, scaling, flipping, and mosaic augmentation are widespread to improve generalization across scales and poses (Briq et al., 2019, Yang et al., 2023, Maji et al., 2022, Janampa et al., 16 Jun 2025).
Batch Size and Memory: Memory constraints can limit RNN unrolling length or batch size (e.g., ConvLSTM models train with batch=1 and unroll up to 6 steps on MSCOCO due to GPU limits) (Briq et al., 2019).
Multi-Scale Testing: Single and multi-scale inference—such as averaging output heatmaps at several image scales—is common for benchmarking; single-shot regressors typically avoid this (Briq et al., 2019, Ou et al., 2020, Maji et al., 2022).
Optimization Algorithms: Adam, AdamW, and SGD (with learning-rate schedules and decays) are standard; self-attention models often employ longer training schedules (up to 250 epochs) and per-module learning rates (Liu et al., 2023, Stoffl et al., 2021, Janampa et al., 16 Jun 2025).
Evaluation Metrics: Average Precision (AP) at multiple OKS thresholds (e.g., AP, AP^50, AP⁷⁵⁾ and average recall (AR) are the principal metrics, with modern methods optimizing directly for OKS (Yang et al., 2023, Maji et al., 2022, Janampa et al., 16 Jun 2025). Video models report mAP on benchmarks such as PoseTrack2017 (Yu et al., 17 Nov 2025).

5. Quantitative Performance and Comparison

Recent end-to-end MPPE methods report strong performance and favorable speed/accuracy trade-offs:

Method	Backbone	COCO AP	CrowdPose AP	FPS/Inference Time	Reference
ED-Pose	RN50/Swin-L	71.6/75.8	69.9/76.6	50 ms/image	(Yang et al., 2023)
Group Pose	RN50/Swin-L	72.0/74.8	74.1	68.6 FPS (480×800)	(Liu et al., 2023)
DETRPose-L	HGNetv2-B4	71.2	73.3	32.5 ms/image	(Janampa et al., 16 Jun 2025)
YOLO-Pose	YOLOv5l6	69.4	—	Constant	(Maji et al., 2022)
POET	RN50	53.6	—	33 FPS (512² @ batch 1)	(Stoffl et al., 2021)
MultiPoseNet	RN101	69.6	—	23 FPS	(Kocabas et al., 2018)
Simple Pose	Stacked HG	68.1	—	38.5 FPS (GPU)	(Li et al., 2019)
PAVE-Net	HRNet-W48	—	—	153 ms/video frame	(Yu et al., 17 Nov 2025)

Improvements are tangible in dense, occluded, or crowded scenes, with transformer set-prediction models demonstrating robust grouping and state-of-the-art accuracy without the compute overhead of per-person cropping or sliding windows.

6. Strengths, Limitations, and Open Directions

Strengths:

Unified, fully-differentiable learning of keypoint localization and person association (Briq et al., 2019, Yang et al., 2023).
Elimination of heuristic grouping, non-maximum suppression, or external detectors (Yang et al., 2023, Liu et al., 2023, Stoffl et al., 2021).
Inference runtime is constant with respect to person count in most transformer-based and single-shot models (Maji et al., 2022, Janampa et al., 16 Jun 2025).
State-of-the-art accuracy in real-time or near-real-time regimes (Janampa et al., 16 Jun 2025, Liu et al., 2023).

Limitations:

Memory and compute constraints in recurrent (ConvLSTM) or attention-based systems can restrict batch size and sequence length during training (Briq et al., 2019).
Performance on highly occluded or fine-scale joints can lag explicit high-resolution/heatmap-based systems (Yang et al., 2023, Maji et al., 2022).
Direct set prediction approaches require careful design of losses and matching to avoid duplicate or missing detections (Stoffl et al., 2021, Janampa et al., 16 Jun 2025).
Most approaches focus on single-frame estimation; explicit cross-frame tracking in video remains an active area (Yu et al., 17 Nov 2025).

Open directions highlighted include integration of spatial/temporal attention modules, lightweight refinement heads for challenging joints, hybrid representations (box and mask), exploration of memory-efficient backbones for larger batch or sequence sizes, and extension to combined detection-tracking or 3D pose estimation (Briq et al., 2019, Yang et al., 2023, Janampa et al., 16 Jun 2025, Yu et al., 17 Nov 2025).

7. Historical Perspective and Impact

End-to-end MPPE methods have redefined the canonical pose estimation pipeline, transcending the two-stage detect-then-pose paradigm. Early approaches such as DeepCut demonstrated the potential of joint subset partitioning with integer programming, albeit with prohibitive computational cost (Pishchulin et al., 2015). The advent of associative embedding, centroid-based grouping, and supervised recurrent inference models signaled progress toward differentiable, jointly trainable networks (Briq et al., 2019, Ou et al., 2020, Tian et al., 2019). The transformer revolution introduced a direct set prediction perspective, now dominant in the field due to its effectiveness in resolving the ambiguities of joint-to-person association and its scalability to video and whole-body keypoints (Stoffl et al., 2021, Yang et al., 2023, Liu et al., 2023, Janampa et al., 16 Jun 2025, Yu et al., 17 Nov 2025). This paradigm shift has accelerated both the practical deployment and the research frontier of large-scale, real-time, and highly accurate multi-person 2D pose estimation.