Multi-Human Pose Encoder (MPE)
- MPE is a multi-human pose conditioning module that processes each person’s 2D skeletal keypoints through a shared convolutional network and aggregates them via additive pooling.
- It significantly improves visual quality and consistency in multi-person video generation, as shown by better SSIM, PSNR, and FVD metrics compared to models without MPE.
- By independently encoding variable numbers of poses and fusing them without fixed padding, MPE robustly handles dynamic multi-person scenes.
Searching arXiv for papers on Multi-Human Pose Encoder and closely related multi-human pose representation methods. Multi-Human Pose Encoder (MPE) denotes a pose-conditioning module designed to encode multiple human body poses into a unified representation for generative or predictive models operating on scenes with more than one person. In the literature provided here, the term appears explicitly in CovOG, a baseline for multi-human talking video generation introduced together with the Multi-human Interactive Talking (MIT) dataset. There, MPE is defined as the component that handles varying numbers of speakers by isolating each person’s pose, encoding each pose independently with a shared convolutional network, and aggregating the resulting embeddings into a single multi-human pose representation used to condition a diffusion backbone (Zhu et al., 5 Aug 2025). More broadly, the idea of a multi-human pose encoder also encompasses adjacent formulations in which multiple people are represented jointly as structured latent queries, person-wise tokens, or hierarchical assignments, even when the term “MPE” is not used explicitly (Yu et al., 17 Nov 2025, Jian et al., 11 Apr 2025).
1. Definition and problem setting
In CovOG, the Multi-Human Pose Encoder is introduced for multi-human talking video generation, where the task is to synthesize a realistic conversational video of 2–4 people conditioned on reference images, body poses, and speech audio (Zhu et al., 5 Aug 2025). The central problem is that standard single-human pose encoders assume one pose input per frame and a fixed conditioning structure, whereas conversational scenes involve multiple people, variable speaker count, changing spatial layouts, and simultaneous speaking and listening roles. MPE is the mechanism proposed to encode each person’s pose separately, remain robust when the number of people changes, and fuse all people into one conditioning signal for the diffusion generator (Zhu et al., 5 Aug 2025).
Within CovOG, MPE is described as the Pose Guider / Pose Adaptor for multi-person scenes. The paper states that “CovOG integrates two key modules: the Multi-Human Pose Encoder (i.e., Pose Guider/Adaptor) and the Interactive Audio Driver (IAD),” and that “the multi-human pose embedding is incorporated into the multi-frame latent noise as pose control before being fed into DenoisingNet” (Zhu et al., 5 Aug 2025). MPE is thus not an auxiliary analysis tool; it is a core conditioning interface for both motion/body layout generation and identity-consistent rendering.
The immediate input domain of MPE in CovOG is a set of individual human poses isolated using instance masks. Those pose annotations are 2D skeletal keypoints, extracted by Sapiens-2B, in COCO133 format with a 59-keypoint subset retained, covering head, body, arms, legs, and hands, with only 3 keypoints for the head retained (Zhu et al., 5 Aug 2025). The paper does not explicitly specify whether the internal representation is raw coordinates or a rendered pose map, but it does state that MPE applies a shared convolutional network to each human pose , which suggests an image-like pose condition rather than a coordinate-only vector. This suggests that the encoder is designed to preserve spatial structure rather than collapse pose immediately into a low-dimensional descriptor.
2. Architecture in CovOG
CovOG is built on AnimateAnyone, which already contains a ReferenceNet, a Pose Guider, and a diffusion DenoisingNet. MPE extends this architecture by serving as the multi-human version of the pose-conditioning pathway (Zhu et al., 5 Aug 2025). It is used in two distinct places inherited from AnimateAnyone: first, in the Pose Guider path to provide pose control to the denoising process; second, in the Pose Adaptor path, where the reference pose is input to obtain a pose embedding that is fused with the latent representation of the corresponding reference image and fed into ReferenceNet (Zhu et al., 5 Aug 2025).
The architectural principle is person-wise factorization followed by scene-level aggregation. Each isolated human pose is passed through a shared convolutional encoder , producing a per-person feature tensor:
The paper gives the per-person feature shape as
indicating a temporally indexed, spatially structured feature tensor (Zhu et al., 5 Aug 2025). The final multi-human representation is then obtained by additive pooling:
This sum is the defining MPE operation in the paper (Zhu et al., 5 Aug 2025).
Several consequences follow directly from this formulation. First, the encoder is permutation-invariant with respect to person ordering. Second, it supports variable cardinality without masking-and-padding a fixed maximum number of persons. Third, it postpones inter-person fusion until after each person has already been encoded independently. The paper does not mention attention across people, graph neural networks, transformers over person tokens, or learned weighted pooling; the aggregation is explicitly a plain sum (Zhu et al., 5 Aug 2025). This suggests that MPE is intended as a simple multi-person extension of single-person pose conditioning rather than a relational reasoning module.
The paper provides only limited implementation details specific to MPE. Explicitly stated are the use of a shared convolutional network, instance masks, training resolution 640 × 384, 15 frames, batch size 4, hardware 4 NVIDIA A6000 GPUs, and initialization from Moore-AnimateAnyone weights. It also states that the Pose Adaptor is used in stage 1 training and then frozen in stage 2 (Zhu et al., 5 Aug 2025). The exact depth, kernel sizes, channel widths, and temporal operator inside are not specified.
3. Mathematical formulation and representational properties
The mathematical formulation of MPE in CovOG is minimal but explicit. For a multi-human pose input
the encoder computes per-human embeddings by a shared network and aggregates them by summation:
This formulation makes MPE a set encoder over people, with the set operation realized by additive pooling (Zhu et al., 5 Aug 2025).
The representational implications are specific. Because each is spatially structured, the sum preserves the collective arrangement of all people in a scene at the feature-map level. This suggests that the aggregated tensor can encode where each person is and what pose each person has, while remaining compatible with the latent spatiotemporal structure of the downstream diffusion model. At the same time, no explicit pairwise interaction term, graph edge, or cross-person attention mechanism appears in the formulation (Zhu et al., 5 Aug 2025). The paper therefore supports the interpretation that MPE captures interaction only implicitly through the superposition of person-specific spatial embeddings.
The representation is also explicitly variable-cardinality. Rather than concatenating a fixed number of person-specific channels, the encoder uses the same 0 for every person and aggregates by a commutative operator. This allows the same module to handle scenes with 2, 3, or 4 speakers (Zhu et al., 5 Aug 2025). A plausible implication is that such a design avoids overfitting to a fixed speaker count, though the paper does not formalize this as a theoretical claim.
The paper contrasts this with AnimateAnyone, stating that the baseline “struggles with multi-person scenarios, as its encoder jointly drives all subjects, while CovOG’s MPE models each person independently and aggregates their effects” (Zhu et al., 5 Aug 2025). That contrast defines the essential conceptual distinction: a single joint pose field for all people is treated as ambiguous, whereas person isolation before encoding is intended to reduce confusion in crowded pose conditions.
4. Empirical evidence in CovOG
The empirical support for MPE in CovOG comes primarily from ablation studies on the multi-human talking video generation task (Zhu et al., 5 Aug 2025). The quantitative ablation in Table 1 compares full CovOG to CovOG without MPE.
| Setting | Full CovOG | CovOG without MPE |
|---|---|---|
| Two Human SSIM | 0.62 | 0.60 |
| Two Human PSNR | 19.16 | 18.88 |
| Two Human FVD | 306.01 | 317.41 |
| Multiple Human SSIM | 0.66 | 0.65 |
| Multiple Human PSNR | 20.21 | 20.00 |
| Multiple Human FVD | 308.68 | 330.50 |
| All Test SSIM | 0.64 | 0.63 |
| All Test PSNR | 19.69 | 19.44 |
| All Test FVD | 307.35 | 323.96 |
Removing MPE lowers SSIM and PSNR and worsens FVD, with the FVD degradation particularly pronounced in multi-human settings (Zhu et al., 5 Aug 2025). The paper states that “The absence of MPE results in the most significant decline, as torso control—essential for multi-person pose generation—heavily impacts visual quality,” and that “Character and background consistency degrade without MPE” (Zhu et al., 5 Aug 2025).
The user study ablation also favors MPE. Full CovOG receives Character consistency 2.93, Background consistency 4.11, AV alignment 3.22, and Visual quality 3.34, whereas CovOG w/o MPE receives 2.64, 3.55, 2.79, and 2.5, respectively (Zhu et al., 5 Aug 2025). These results tie MPE to better character consistency, background consistency, and overall visual quality.
The evidence, however, is narrow in scope. The paper compares MPE against the absence of MPE and against AnimateAnyone’s joint multi-subject conditioning, but does not compare the additive pooling design against stronger alternatives such as attention-based set encoders, graph relational encoders, or slot-based person encoders (Zhu et al., 5 Aug 2025). This suggests that the empirical case supports the utility of multi-person isolation plus aggregation, but does not establish the optimality of summation as the aggregation rule.
5. Relation to adjacent multi-human encoding paradigms
Although the term “Multi-Human Pose Encoder” is explicit in CovOG, closely related representational ideas appear in other recent work. In PAVE-Net, multi-person video pose estimation is formulated as direct set prediction over a clip using a fixed set of learned pose queries. The combination of the Spatial Encoder and Spatiotemporal Pose Decoder functions as a joint multi-human latent representation, even though it is not named MPE (Yu et al., 17 Nov 2025). There, each person is represented by a latent pose query token, and the decoder updates these queries through self-attention and pose-aware cross-frame attention. This is a fundamentally different design from CovOG’s independent-then-sum formulation: PAVE-Net’s multi-human representation is interaction-capable at the latent-token level, whereas CovOG’s MPE aggregates independent person embeddings only after separate encoding (Yu et al., 17 Nov 2025).
A second adjacent design appears in EMO-X, where multiple humans are represented by selected body tokens, expanded into human queries, and refined by the Scan-based Global-Local Decoder (SGLD) (Jian et al., 11 Apr 2025). The sequence
1
shows that EMO-X carries a structured latent representation with 2 persons and 3 primary joints per person (Jian et al., 11 Apr 2025). Its Global Mamba Block performs cross-person contextual encoding, while the Local Mamba Block performs skeleton-aware within-person refinement. Relative to CovOG’s MPE, this is a more explicitly relational and tokenized multi-human pose representation.
Earlier work on multi-person pose estimation also contributes relevant structural perspectives. “Multi-Person Pose Estimation via Column Generation” proposes a hierarchical representation in which a multi-human pose is composed of global poses and local assignments, explicitly separating person-level canonical joints from same-part ambiguity clusters (Wang et al., 2017). “Efficient Multi-Person Pose Estimation with Provable Guarantees” formulates person hypotheses as structured sets of detections within a minimum-weight set packing problem (Wang et al., 2017). These methods are not neural encoders, but they show that multi-human pose can be represented as a structured composition of person-wise entities under global consistency constraints. This suggests a broader interpretation of MPE as any module that maps a multi-person scene into an internally coherent person-structured pose representation.
A common misconception is that any module processing multiple human poses jointly necessarily models interpersonal interaction explicitly. The CovOG formulation does not support that conclusion. Its MPE models each person independently and aggregates their embeddings additively; no pairwise social relation, turn-taking graph, or cross-person attention appears in the method description (Zhu et al., 5 Aug 2025). By contrast, methods such as PAVE-Net and EMO-X do contain explicit multi-entity interaction mechanisms at the token level (Yu et al., 17 Nov 2025, Jian et al., 11 Apr 2025). The distinction is important: multi-human encoding does not imply relational reasoning unless the architecture contains explicit interaction operators.
6. Limitations, interpretation, and future directions
The main limitations of MPE as presented in CovOG are architectural simplicity and the absence of explicit relational modeling. The paper does not identify MPE-specific failure cases in isolation, but it notes broader multi-human talking generation challenges such as side-face speech alignment, identity consistency under large rotations, and rapid switching between speaking and listening (Zhu et al., 5 Aug 2025). Since MPE only handles the pose side, these limitations suggest that richer conversational modeling requires coordination with audio-driven modules such as IAD rather than pose encoding alone.
Another limitation is specification granularity. The internal architecture of 4, including layer counts, channel widths, temporal operators, and the exact rendering of the 59-keypoint pose signal, is not given in the available text (Zhu et al., 5 Aug 2025). This constrains faithful reimplementation from the paper alone and makes the released code particularly important. A plausible implication is that the conceptual contribution of MPE is clearer than its standalone architectural novelty.
From a broader research perspective, several possible directions follow from comparison with adjacent work. One direction is to replace additive pooling with learned set aggregation, such as attention-based or graph-based fusion. Another is to preserve person-specific tokens throughout the generative backbone, as in query-based systems like PAVE-Net or EMO-X, rather than collapsing all persons into a single summed tensor (Yu et al., 17 Nov 2025, Jian et al., 11 Apr 2025). A third direction is to incorporate visibility- or occlusion-aware completion before aggregation, analogous to explicit occlusion reasoning in bottom-up multi-person 3D pose estimation (Liu et al., 2022). This suggests that future MPE variants may evolve from simple multi-person conditioning modules into richer scene-level human-structure encoders.
In the current literature considered here, MPE is best characterized as a set-style multi-person pose encoder whose defining principle is independent per-person encoding plus permutation-invariant aggregation. In CovOG, that design provides a practical extension of single-person pose conditioning to conversational scenes with 2–4 people, improves visual quality and consistency over a no-MPE ablation, and remains compatible with AnimateAnyone-style diffusion conditioning (Zhu et al., 5 Aug 2025). At the same time, related research indicates that multi-human pose encoding can also be realized through person queries, structured token sets, or hierarchical optimization-based representations, suggesting that MPE is less a single architecture than a family of techniques for representing multiple articulated humans within one computational model (Yu et al., 17 Nov 2025, Jian et al., 11 Apr 2025, Wang et al., 2017).