Papers
Topics
Authors
Recent
Search
2000 character limit reached

Dream, Lift, Animate: 3D Avatar Reconstruction

Updated 3 July 2026
  • The paper introduces a three-stage pipeline (Dream, Lift, Animate) to reconstruct animatable 3D human avatars from one RGB image using generative diffusion and unstructured Gaussian representations.
  • It unifies pose-conditioned multi-view synthesis with transformer-based latent structuring for state-of-the-art photometric accuracy and robust real-time rendering.
  • Experimental results on ActorsHQ and 4D-Dress datasets show superior LPIPS, PSNR, and SSIM metrics compared to earlier single-image reconstruction methods.

Dream, Lift, Animate (DLA) is an end-to-end differentiable pipeline designed for reconstructing high-fidelity, animatable 3D human avatars from a single RGB image. Leveraging multi-view generative diffusion, unstructured 3D Gaussian representations, and a pose-aware UV-space mapping, DLA bridges the gap between unstructured generative models and animation-ready avatars with state-of-the-art perceptual and photometric accuracy. The method incorporates pose conditioning via SMPL-X parameters, unifies multi-view synthesis and 3D lifting, and utilizes a transformer-based latent structuring aligned to UV manifolds, providing robust animation support and efficient, real-time rendering (Bühler et al., 21 Jul 2025).

1. Pipeline Stages: Dream, Lift, Animate

DLA organizes avatar reconstruction into three sequential stages:

  1. Dream (Multi-view Generation):
    • Starting from a single input image I1I_1, SMPL-X parameters Θi\Theta_i are estimated, enabling the rendering of 2D skeletal control maps from VV virtual cameras.
    • A pretrained video diffusion model (e.g., UniAnimate or ControlNet) generates novel, plausible views {I2n,...,IVn}\{I_2^n, ..., I_V^n\}, inferring appearance and geometry for occluded regions.
    • This process introduces view-to-view inconsistencies due to the inherent ambiguity in view hallucination.
  2. Lift (Unstructured Gaussian Reconstruction and Latent Encoding):
    • The VV views are processed by a U-Net-based reconstruction model G\mathcal{G}, which outputs a dense set of pixel-aligned 3D Gaussians Gkp=(μk,Σk,αk,ck)G_k^p = (\mu_k, \Sigma_k, α_k, c_k) in the input pose space, where Σk\Sigma_k encodes anisotropic covariance.
    • Gaussians are merged, filtered by opacity, and subsampled (using farthest-point sampling) to yield PP unstructured Gaussians with embeddings X∈RP×CpX \in \mathbb{R}^{P \times C_p}.
    • A transformer encoder Θi\Theta_i0 performs cross-attention between Gaussian features and a UV-aligned query grid, yielding a structured latent avatar code Θi\Theta_i1 aligned to the SMPL-X UV manifold.
  3. Animate (UV-space Gaussian Decoding and Deformation):
    • The latent code Θi\Theta_i2 enters a Gaussian Parameter Decoder (GPD)—a spatially-adaptive CNN conditioned on Θi\Theta_i3 and a UV segmentation map.
    • The GPD outputs a canonical map Θi\Theta_i4 (aligned to a neutral pose) and an offset map Θi\Theta_i5 (encoding pose/view-dependent corrections); both Θi\Theta_i6.
    • Sampling Θi\Theta_i7 at Θi\Theta_i8 UV-surface locations yields structured Gaussians Θi\Theta_i9 in tangent space.
    • These Gaussians are skinned to a target pose VV0 and camera VV1 using linear blend skinning, producing world-space Gaussians that are rendered by Gaussian Splatting to form the final image VV2.

2. Mathematical Formulation of 3D Gaussian Lifting

Each Gaussian primitive is described as a 3D anisotropic Gaussian VV3. In detail:

  • The covariance VV4 is factorized as VV5, where VV6 is rotation and VV7 encodes scale.
  • The U-Net-based reconstructor VV8 ingests the pose-normalized, multi-view features and outputs per-pixel variables:
    • Local offsets VV9 (position), {I2n,...,IVn}\{I_2^n, ..., I_V^n\}0 (rotation), {I2n,...,IVn}\{I_2^n, ..., I_V^n\}1 (scale), opacity {I2n,...,IVn}\{I_2^n, ..., I_V^n\}2, and color {I2n,...,IVn}\{I_2^n, ..., I_V^n\}3.
  • All outputs are merged across views, and after opacity thresholding and subsampling, yield a final unstructured point cloud.
  • Before UV alignment, a Gaussian's mean is given by {I2n,...,IVn}\{I_2^n, ..., I_V^n\}4 global pose origin {I2n,...,IVn}\{I_2^n, ..., I_V^n\}5.
  • The feature embedding {I2n,...,IVn}\{I_2^n, ..., I_V^n\}6 includes U-Net intermediate activations concatenated with Gaussian parameters.

3. Transformer-based Latent Structuring and Training Objectives

Structured latent features {I2n,...,IVn}\{I_2^n, ..., I_V^n\}7 are constructed as follows:

  • Gaussian features {I2n,...,IVn}\{I_2^n, ..., I_V^n\}8 are projected into the transformer as {I2n,...,IVn}\{I_2^n, ..., I_V^n\}9 in the cross-attention mechanism.
  • The SMPL-X UV grid is rasterized into a 3D position map VV0; a sinusoidal positional encoding maps it to queries VV1.
  • Cross-attention is formulated as:

VV2

  • The resulting attended features are reshaped into a UV-grid-aligned tensor VV3.

Supervision is end-to-end, with the following losses:

  • The Gaussian Parameter Decoder is trained via

VV4

where - VV5 is an VV6 photo loss, - VV7 is a mask VV8 loss, - VV9 is a masked perceptual loss, - G\mathcal{G}0 is a PatchGAN least-squares adversarial loss, - G\mathcal{G}1 regularizes G\mathcal{G}2, - G\mathcal{G}3 penalizes excessive offsets.

  • The unstructured Gaussian reconstructor G\mathcal{G}4 is trained using VGG and mask losses in pose space.

4. Rendering, Real-Time Constraints, and Editing

  • The Gaussian Parameter Decoder (GPD) produces two branches:
    • The canonical branch generates G\mathcal{G}5—Gaussian parameters anchored to the avatar's neutral pose in UV space.
    • The offset branch creates G\mathcal{G}6, encoding pose/view corrections via rasterized normal maps, Plücker ray maps, and vertex offsets.
  • For a new pose/camera, linear blend skinning transforms SMPL-X joints to compute per-vertex transformations G\mathcal{G}7; these are applied to the canonical parameters to yield world-space Gaussian parameters.
  • Rendering employs the continuous Gaussian Splatting algorithm, where the final image is

G\mathcal{G}8

  • For fixed avatars, G\mathcal{G}9 can be cached; only the offset Gkp=(μk,Σk,αk,ck)G_k^p = (\mu_k, \Sigma_k, α_k, c_k)0 is recomputed per frame, enabling 512×512 frame rendering at 33 FPS on an NVIDIA RTX 5880 with full animation and Gaussian splatting.
  • The UV-aligned latent Gkp=(μk,Σk,αk,ck)G_k^p = (\mu_k, \Sigma_k, α_k, c_k)1 supports direct, part-aware edits and identity or clothing morphing by interpolating or modifying latent patches.

5. Quantitative Results and Comparisons

DLA demonstrates state-of-the-art performance on ActorsHQ and 4D-Dress datasets, evaluated using PSNR (photometric), SSIM (structural), and LPIPS (perceptual similarity):

Dataset Novel Views (LPIPS ↓ / PSNR ↑ / SSIM ↑) Novel Poses (LPIPS ↓ / PSNR ↑ / SSIM ↑) Baseline (IDOL)
ActorsHQ 0.0580 / 25.58 / 0.9279 0.0471 / 26.41 / 0.9351 0.0696 / 24.48 / 0.9261 (views), 0.0940 / 22.80 / 0.9194 (poses)
4D-Dress 0.0594 / 24.95 / 0.9294 — 0.0904 / 23.04 / 0.9226 (views)

Reported metrics establish superiority over prior single-image methods, including DreamGaussian, SiTH, SIFU, and the concurrent animatable Gaussian approach IDOL (Bühler et al., 21 Jul 2025).

6. Limitations and Potential Advancements

DLA currently exhibits several limitations:

  • Color leakage occurs when body parts are in close proximity (e.g., hand near torso).
  • The pipeline cannot fully correct major inconsistencies inherited from hallucinated multi-views in the "Dream" stage.
  • The method is susceptible to slight identity drift in facial details due to the limited high-resolution facial data in training sets.

Anticipated future directions include:

  • Expanding training on large-scale, in-the-wild monocular datasets to enhance robustness.
  • Integrating higher-resolution facial priors to improve identity preservation.
  • Refining the generative diffusion-based "Dream" stage to mitigate view inconsistency before 3D lifting (Bühler et al., 21 Jul 2025).
Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dream, Lift, Animate (DLA).