Dream, Lift, Animate: 3D Avatar Reconstruction

Updated 3 July 2026

The paper introduces a three-stage pipeline (Dream, Lift, Animate) to reconstruct animatable 3D human avatars from one RGB image using generative diffusion and unstructured Gaussian representations.
It unifies pose-conditioned multi-view synthesis with transformer-based latent structuring for state-of-the-art photometric accuracy and robust real-time rendering.
Experimental results on ActorsHQ and 4D-Dress datasets show superior LPIPS, PSNR, and SSIM metrics compared to earlier single-image reconstruction methods.

Dream, Lift, Animate (DLA) is an end-to-end differentiable pipeline designed for reconstructing high-fidelity, animatable 3D human avatars from a single RGB image. Leveraging multi-view generative diffusion, unstructured 3D Gaussian representations, and a pose-aware UV-space mapping, DLA bridges the gap between unstructured generative models and animation-ready avatars with state-of-the-art perceptual and photometric accuracy. The method incorporates pose conditioning via SMPL-X parameters, unifies multi-view synthesis and 3D lifting, and utilizes a transformer-based latent structuring aligned to UV manifolds, providing robust animation support and efficient, real-time rendering (Bühler et al., 21 Jul 2025).

1. Pipeline Stages: Dream, Lift, Animate

DLA organizes avatar reconstruction into three sequential stages:

Dream (Multi-view Generation):
- Starting from a single input image $I_1$ , SMPL-X parameters $\Theta_i$ are estimated, enabling the rendering of 2D skeletal control maps from $V$ virtual cameras.
- A pretrained video diffusion model (e.g., UniAnimate or ControlNet) generates novel, plausible views $\{I_2^n, ..., I_V^n\}$ , inferring appearance and geometry for occluded regions.
- This process introduces view-to-view inconsistencies due to the inherent ambiguity in view hallucination.
Lift (Unstructured Gaussian Reconstruction and Latent Encoding):
- The $V$ views are processed by a U-Net-based reconstruction model $\mathcal{G}$ , which outputs a dense set of pixel-aligned 3D Gaussians $G_k^p = (\mu_k, \Sigma_k, α_k, c_k)$ in the input pose space, where $\Sigma_k$ encodes anisotropic covariance.
- Gaussians are merged, filtered by opacity, and subsampled (using farthest-point sampling) to yield $P$ unstructured Gaussians with embeddings $X \in \mathbb{R}^{P \times C_p}$ .
- A transformer encoder $\Theta_i$ 0 performs cross-attention between Gaussian features and a UV-aligned query grid, yielding a structured latent avatar code $\Theta_i$ 1 aligned to the SMPL-X UV manifold.
Animate (UV-space Gaussian Decoding and Deformation):
- The latent code $\Theta_i$ 2 enters a Gaussian Parameter Decoder (GPD)—a spatially-adaptive CNN conditioned on $\Theta_i$ 3 and a UV segmentation map.
- The GPD outputs a canonical map $\Theta_i$ 4 (aligned to a neutral pose) and an offset map $\Theta_i$ 5 (encoding pose/view-dependent corrections); both $\Theta_i$ 6.
- Sampling $\Theta_i$ 7 at $\Theta_i$ 8 UV-surface locations yields structured Gaussians $\Theta_i$ 9 in tangent space.
- These Gaussians are skinned to a target pose $V$ 0 and camera $V$ 1 using linear blend skinning, producing world-space Gaussians that are rendered by Gaussian Splatting to form the final image $V$ 2.

2. Mathematical Formulation of 3D Gaussian Lifting

Each Gaussian primitive is described as a 3D anisotropic Gaussian $V$ 3. In detail:

The covariance $V$ 4 is factorized as $V$ 5, where $V$ 6 is rotation and $V$ 7 encodes scale.
The U-Net-based reconstructor $V$ $V$ 8 ingests the pose-normalized, multi-view features and outputs per-pixel variables:
- Local offsets $V$ 9 (position), $\{I_2^n, ..., I_V^n\}$ 0 (rotation), $\{I_2^n, ..., I_V^n\}$ 1 (scale), opacity $\{I_2^n, ..., I_V^n\}$ 2, and color $\{I_2^n, ..., I_V^n\}$ 3.
All outputs are merged across views, and after opacity thresholding and subsampling, yield a final unstructured point cloud.
Before UV alignment, a Gaussian's mean is given by $\{I_2^n, ..., I_V^n\}$ 4 global pose origin $\{I_2^n, ..., I_V^n\}$ 5.
The feature embedding $\{I_2^n, ..., I_V^n\}$ 6 includes U-Net intermediate activations concatenated with Gaussian parameters.

3. Transformer-based Latent Structuring and Training Objectives

Structured latent features $\{I_2^n, ..., I_V^n\}$ 7 are constructed as follows:

Gaussian features $\{I_2^n, ..., I_V^n\}$ 8 are projected into the transformer as $\{I_2^n, ..., I_V^n\}$ 9 in the cross-attention mechanism.
The SMPL-X UV grid is rasterized into a 3D position map $V$ 0; a sinusoidal positional encoding maps it to queries $V$ 1.
Cross-attention is formulated as:

$V$ 2

The resulting attended features are reshaped into a UV-grid-aligned tensor $V$ 3.

Supervision is end-to-end, with the following losses:

The Gaussian Parameter Decoder is trained via

$V$ 4

where - $V$ 5 is an $V$ 6 photo loss, - $V$ 7 is a mask $V$ 8 loss, - $V$ 9 is a masked perceptual loss, - $\mathcal{G}$ 0 is a PatchGAN least-squares adversarial loss, - $\mathcal{G}$ 1 regularizes $\mathcal{G}$ 2, - $\mathcal{G}$ 3 penalizes excessive offsets.

The unstructured Gaussian reconstructor $\mathcal{G}$ 4 is trained using VGG and mask losses in pose space.

4. Rendering, Real-Time Constraints, and Editing

The Gaussian Parameter Decoder (GPD) produces two branches:
- The canonical branch generates $\mathcal{G}$ 5—Gaussian parameters anchored to the avatar's neutral pose in UV space.
- The offset branch creates $\mathcal{G}$ 6, encoding pose/view corrections via rasterized normal maps, Plücker ray maps, and vertex offsets.
For a new pose/camera, linear blend skinning transforms SMPL-X joints to compute per-vertex transformations $\mathcal{G}$ 7; these are applied to the canonical parameters to yield world-space Gaussian parameters.
Rendering employs the continuous Gaussian Splatting algorithm, where the final image is

$\mathcal{G}$ 8

For fixed avatars, $\mathcal{G}$ 9 can be cached; only the offset $G_k^p = (\mu_k, \Sigma_k, α_k, c_k)$ 0 is recomputed per frame, enabling 512×512 frame rendering at 33 FPS on an NVIDIA RTX 5880 with full animation and Gaussian splatting.
The UV-aligned latent $G_k^p = (\mu_k, \Sigma_k, α_k, c_k)$ 1 supports direct, part-aware edits and identity or clothing morphing by interpolating or modifying latent patches.

5. Quantitative Results and Comparisons

DLA demonstrates state-of-the-art performance on ActorsHQ and 4D-Dress datasets, evaluated using PSNR (photometric), SSIM (structural), and LPIPS (perceptual similarity):

Dataset	Novel Views (LPIPS ↓ / PSNR ↑ / SSIM ↑)	Novel Poses (LPIPS ↓ / PSNR ↑ / SSIM ↑)	Baseline (IDOL)
ActorsHQ	0.0580 / 25.58 / 0.9279	0.0471 / 26.41 / 0.9351	0.0696 / 24.48 / 0.9261 (views), 0.0940 / 22.80 / 0.9194 (poses)
4D-Dress	0.0594 / 24.95 / 0.9294	—	0.0904 / 23.04 / 0.9226 (views)

Reported metrics establish superiority over prior single-image methods, including DreamGaussian, SiTH, SIFU, and the concurrent animatable Gaussian approach IDOL (Bühler et al., 21 Jul 2025).

6. Limitations and Potential Advancements

DLA currently exhibits several limitations:

Color leakage occurs when body parts are in close proximity (e.g., hand near torso).
The pipeline cannot fully correct major inconsistencies inherited from hallucinated multi-views in the "Dream" stage.
The method is susceptible to slight identity drift in facial details due to the limited high-resolution facial data in training sets.

Anticipated future directions include:

Expanding training on large-scale, in-the-wild monocular datasets to enhance robustness.
Integrating higher-resolution facial priors to improve identity preservation.
Refining the generative diffusion-based "Dream" stage to mitigate view inconsistency before 3D lifting (Bühler et al., 21 Jul 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Dream, Lift, Animate: From Single Images to Animatable Gaussian Avatars (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dream, Lift, Animate (DLA).

Dream, Lift, Animate: 3D Avatar Reconstruction

1. Pipeline Stages: Dream, Lift, Animate

2. Mathematical Formulation of 3D Gaussian Lifting

3. Transformer-based Latent Structuring and Training Objectives

4. Rendering, Real-Time Constraints, and Editing

5. Quantitative Results and Comparisons

6. Limitations and Potential Advancements

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Dream, Lift, Animate: 3D Avatar Reconstruction

1. Pipeline Stages: Dream, Lift, Animate

2. Mathematical Formulation of 3D Gaussian Lifting

3. Transformer-based Latent Structuring and Training Objectives

4. Rendering, Real-Time Constraints, and Editing

5. Quantitative Results and Comparisons

6. Limitations and Potential Advancements

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research