NoPo-Avatar: Pose-Free 3D Human Reconstruction

Updated 3 July 2026

The paper introduces a feed-forward method that reconstructs a 3D human avatar in a canonical T-pose, effectively sidestepping noisy pose inputs at test time.
It fuses a canonical template branch with image-based detail branches, ensuring both sharp detail and plausible inpainting in sparse-view scenarios.
The approach leverages a dual-branch transformer with per-primitive LBS and Gaussian splatting, achieving superior metrics on datasets like THuman2.0, XHuman, and HuGe100K.

NoPo-Avatar is a feed-forward, one-stage method for reconstructing an animatable 3D human avatar from one or a few images alone, without camera or body-pose estimates at test time. Its central design is to recover shape and appearance in a canonical T-pose space, predict per-primitive linear blend skinning (LBS) weights, and defer articulation to a post-hoc LBS step followed by Gaussian splatting. This formulation is motivated by the observation that pose-dependent reconstruction degrades significantly if pose estimates are noisy, and it is evaluated on THuman2.0, XHuman, and HuGe100K, where it outperforms existing baselines in practical settings without ground-truth poses and delivers comparable results in lab settings with ground-truth poses (Wen et al., 20 Nov 2025).

1. Problem setting and conceptual scope

Most existing generalizable avatar methods assume accurate test-time camera and SMPL-X poses in order to locate correspondences across views and gather pose-aligned features. NoPo-Avatar is defined against that assumption: it reconstructs avatars solely from images, without any pose input, and thereby removes a major failure mode associated with noisy off-the-shelf estimators. The method is described as avoiding the “garbage-in, garbage-out” problem of noisy pose inputs while still supporting arbitrary novel-view and novel-pose synthesis (Wen et al., 20 Nov 2025).

The formulation is organized around two modules. The reconstruction module

$\mathrm{Recon}(\{I_n, M_n\}_{n=1}^{N}) \to G$

consumes $N$ images $I_n \in \mathbb{R}^{H \times W \times 3}$ and masks $M_n \in \{0,1\}^{H \times W}$ and outputs a canonical Gaussian-based avatar $G$ . The rendering module

$\mathrm{Render}(G; E, K, P) \to (I, M)$

warps $G$ under a target pose $P$ and cameras $(E,K)$ via LBS and then Gaussian splatting to yield novel-view, novel-pose images and masks. This suggests that “without human poses” refers specifically to the reconstruction input at test time rather than to the later animation stage, where target pose remains an explicit control variable.

A common misconception is to treat pose-free reconstruction and pose-free animation as the same requirement. In NoPo-Avatar they are separated. Reconstruction is performed from images and masks alone, whereas animation is obtained later by applying LBS in canonical space. That factorization is the method’s defining technical choice.

2. Canonical avatar representation

NoPo-Avatar represents the avatar as the union of two splatter-image branches:

$G = G^T \cup G^I.$

The template branch

$N$ 0

encodes a deformable canonical template, specifically an average SMPL-X T-pose plus learned residuals. The image branches

$N$ 1

predict one Gaussian primitive per foreground pixel in each input image (Wen et al., 20 Nov 2025).

Each primitive

$N$ 2

contains a mean $N$ 3, scale $N$ 4, quaternion rotation $N$ 5, opacity $N$ 6, spherical-harmonics color $N$ 7, and LBS weights $N$ 8 for the $N$ 9 SMPL-X bones. At test time, articulation is performed by

$I_n \in \mathbb{R}^{H \times W \times 3}$ 0

followed by

$I_n \in \mathbb{R}^{H \times W \times 3}$ 1

The two-branch decomposition is not incidental. The template branch provides a canonical human-shape prior and supports plausible inpainting in unseen regions, while the image branches preserve visible subject-specific detail. In the reported dual-branch ablation, template-only yields coarse inpaint but misses details; image-only reconstructs only visible regions; the combined representation gives both sharp detail and plausible inpainting. That ablation is central to understanding why NoPo-Avatar is both generalizable and animatable.

3. Encoder–decoder architecture and objective function

The reconstruction network is a two-branch transformer. The template encoder is a learnable embedding

$I_n \in \mathbb{R}^{H \times W \times 3}$ 2

constant across subjects and used to inject a canonical human-shape prior. The shared image encoder is ViT-based:

$I_n \in \mathbb{R}^{H \times W \times 3}$ 3

applied independently to each input. A stack of $I_n \in \mathbb{R}^{H \times W \times 3}$ 4 decoder blocks then performs cross-attention, where each branch feature $I_n \in \mathbb{R}^{H \times W \times 3}$ 5 or $I_n \in \mathbb{R}^{H \times W \times 3}$ 6 cross-attends to all other branches’ features, propagating information between template and views and also between views (Wen et al., 20 Nov 2025).

Prediction is performed by DPT-based heads. The template head regresses residuals over the SMPL-X UV rasterization to produce $I_n \in \mathbb{R}^{H \times W \times 3}$ 7, and the image heads directly regress $I_n \in \mathbb{R}^{H \times W \times 3}$ 8 for each masked pixel $I_n \in \mathbb{R}^{H \times W \times 3}$ 9. The model is trained end-to-end with

$M_n \in \{0,1\}^{H \times W}$ 0

The components are:

$M_n \in \{0,1\}^{H \times W}$ 1, photometric MSE.
$M_n \in \{0,1\}^{H \times W}$ 2, a perceptual loss.
$M_n \in \{0,1\}^{H \times W}$ 3, encouraging the two branches to agree in 3D.
$M_n \in \{0,1\}^{H \times W}$ 4, a projection loss that enforces each image-branch Gaussian to explain its input view photometrically and to project its mean back to its pixel location.
$M_n \in \{0,1\}^{H \times W}$ 5, where pseudo weights are rasterized from the SMPL-X training mesh.

The projection term is specified as

$M_n \in \{0,1\}^{H \times W}$ 6

The reported hyperparameters are $M_n \in \{0,1\}^{H \times W}$ 7, $M_n \in \{0,1\}^{H \times W}$ 8, $M_n \in \{0,1\}^{H \times W}$ 9, and $G$ 0.

The ablations directly support the role of these losses. Without $G$ 1, image branches collapse and render a blurry template only. Without $G$ 2, primitives fail to align in T-pose and LBS weights degenerate. These results identify the model not simply as a Gaussian-splat regressor, but as a carefully constrained canonicalization-and-skinning system.

4. Training regime and inference behavior

The training sets are THuman2.0, THuman2.1, and HuGe100K. THuman2.0 contains 426 subjects with 64 $G$ 3 $G$ 4 multiview renders. THuman2.1 contains $G$ 5K subjects. HuGe100K contains 100K+ diffusion-rendered avatars. The model is trained with inputs $G$ 6 images plus masks (Wen et al., 20 Nov 2025).

Training is progressive: $G$ 7 resolution for 300K iterations, $G$ 8 for 300K iterations, and then full resolution, either $G$ 9 or $\mathrm{Render}(G; E, K, P) \to (I, M)$ 0, for 50K iterations. The optimizer and learning-rate schedule are described as the same as in NoPoSplat. Batch size is 4 on THuman and up to 16 on HuGe100K. Training uses NVIDIA L40S/H200 GPUs, and total training is approximately 12 days.

At inference, the reconstruction stage remains feed-forward. NoPo-Avatar does not require camera or body-pose estimates as test-time reconstruction input. For animation, however, the user supplies a desired SMPL-X shape $\mathrm{Render}(G; E, K, P) \to (I, M)$ 1 and pose $\mathrm{Render}(G; E, K, P) \to (I, M)$ 2, or chooses any new pose, after which the system computes

$\mathrm{Render}(G; E, K, P) \to (I, M)$ 3

and renders through Gaussian splatting. Reported runtime is approximately $\mathrm{Render}(G; E, K, P) \to (I, M)$ 4 for $\mathrm{Render}(G; E, K, P) \to (I, M)$ 5 inputs (Wen et al., 20 Nov 2025).

This reconstruction-then-articulation separation is the method’s primary operational property. By predicting all Gaussians in T-pose and learning per-primitive LBS weights, the model factors pose-invariant shape and appearance from pose-dependent deformation. During inference, no image-branch Gaussian “knows” the input camera or pose; articulation is applied afterward as a geometric operation.

5. Quantitative evaluation, ablations, and failure modes

The reported evaluation uses PSNR, LPIPS* and FID. On THuman2.0 novel view with 3 inputs and predicted test-time poses, LIFe-GoM achieves 19.70 PSNR, 146.19 LPIPS*, and 63.34 FID, whereas NoPo-Avatar achieves 22.49 PSNR, 105.45 LPIPS*, and 42.19 FID. With ground-truth poses on the same benchmark, LIFe-GoM achieves 24.65, 110.82, and 51.27, while NoPo-Avatar remains at 22.49, 105.45, and 42.19. On single-view HuGe100K, IDOL reports 20.89, 111.68, and 16.91; LHM-1B reports 17.48, 129.63, and 25.65; NoPo-Avatar reports 23.15, 90.63, and 15.56. On cross-domain XHuman novel-pose synthesis with test-time pose optimization, LIFe-GoM* reports 23.98, 116.42, and 42.64, while Ours* reports 24.79, 103.09, and 34.68 (Wen et al., 20 Nov 2025).

Setting	Baseline(s)	NoPo-Avatar
THuman2.0 novel view, 3 inputs, predicted poses	LIFe-GoM: 19.70 / 146.19 / 63.34	22.49 / 105.45 / 42.19
THuman2.0, ground-truth poses	LIFe-GoM: 24.65 / 110.82 / 51.27	22.49 / 105.45 / 42.19
HuGe100K single-view	IDOL: 20.89 / 111.68 / 16.91; LHM-1B: 17.48 / 129.63 / 25.65	23.15 / 90.63 / 15.56
XHuman cross-domain novel-pose	LIFe-GoM*: 23.98 / 116.42 / 42.64	24.79 / 103.09 / 34.68

The practical interpretation supplied by the paper is explicit: NoPo-Avatar is unaffected by pose noise. At the same time, its results with ground-truth poses are presented as comparable in lab settings rather than uniformly dominant, which is important for a balanced reading of the method’s claims.

The qualitative and ablation findings are equally specific. Template-only reconstruction yields coarse inpaint but misses details. Image-only reconstruction covers only visible regions. Combined reconstruction provides sharp detail and plausible inpainting. Without $\mathrm{Render}(G; E, K, P) \to (I, M)$ 6, image branches collapse and render blurry template only. Without $\mathrm{Render}(G; E, K, P) \to (I, M)$ 7, primitives fail to align in T-pose and LBS weights degenerate. The reported failure modes are that hands and expressions are still blurry if heavily occluded, large unseen areas can be over-smoothed, and diffusion-rendered HuGe100K sometimes introduces multiview inconsistency, leading to semi-transparent artifacts.

6. Position within avatar research

NoPo-Avatar sits in a broader line of work on generalizable, animatable avatar reconstruction, but it makes a narrower claim than some neighboring systems: it removes dependence on test-time human poses for reconstruction, not the need for a target pose when performing animation. That distinction separates it from earlier pose-conditioned pipelines such as "Neural Image-based Avatars: Generalizable Radiance Fields for Human Avatar Modeling" (Kwon et al., 2023). NIA takes a small set of calibrated RGB images of a new subject in a reference pose, with foreground masks and a fitted SMPL model for each view, and combines an implicit-body NeRF representation with an image-based rendering branch via a hybrid appearance blending module. In that sense, NIA addresses sparse-view generalization and pose transfer, but it retains explicit body-model conditioning at inference.

A different neighboring direction is "Vid2Avatar-Pro: Authentic Avatar from Videos in the Wild via Universal Prior" (Guo et al., 3 Mar 2025). Vid2Avatar-Pro creates photorealistic and animatable 3D human avatars from monocular in-the-wild videos by learning a universal prior model from a large corpus of multi-view clothed human performance capture data, representing geometry and appearance with expressive 3D Gaussians in canonical space, and fine-tuning via inverse rendering. Its emphasis is photorealistic video-based personalization supported by a learned universal prior, whereas NoPo-Avatar emphasizes feed-forward reconstruction from one or a few images without test-time pose input.

There is also a distinct single-image art-avatar line represented by "AniArtAvatar: Animatable 3D Art Avatar from a Single Image" (Li, 2024). AniArtAvatar uses a view-conditioned 2D diffusion model to synthesize multi-view images from a single art portrait with a neutral expression, reconstructs a static avatar using an SDF-based neural surface, extracts and projects landmarks into 3D, transfers motion with 3DMM landmarks, and animates head and torso via cages. In a separate adaptation built from AniArtAvatar, the label “NoPo-Avatar” is used for a one-shot, single-image pipeline for creating an animatable 3D art avatar with no extra pose supervision. That usage is conceptually related but methodologically distinct from the human-avatar paper NoPo-Avatar, whose published contribution is a canonical Gaussian-splat framework for sparse-input human reconstruction without test-time pose estimates.

Taken together, these comparisons place NoPo-Avatar at the intersection of canonical-space avatar modeling, Gaussian splatting, and pose-robust generalization. Its distinguishing contribution is not merely sparse-input reconstruction, but sparse-input reconstruction that explicitly removes human-pose dependence at test time while retaining post-hoc animation through learned LBS weights.