Papers
Topics
Authors
Recent
2000 character limit reached

PALM-Net: Physically Based Hand Avatar Reconstruction

Updated 14 November 2025
  • The paper's primary contribution is an integrated framework that combines inverse rendering, continuous implicit 3D surface parameterization, and Disney-style BRDFs to achieve personalized hand avatars.
  • It leverages a diverse dataset of 13,000 hand scans and 90,000 multi-view images from 263 subjects to capture rich intra- and inter-subject variability in geometry, skin tone, and appearance.
  • The method integrates differentiable volume rendering with specialized neural networks to deliver robust novel view synthesis and improved performance metrics such as PSNR, SSIM, and LPIPS.

PALM-Net is a multi-subject, physically based, inverse-rendering–based hand prior that enables the construction of personalized, relightable 3D hand avatars from single RGB images. Developed as the baseline method in the PALM dataset initiative, PALM-Net is trained on 13,000 high-quality hand scans and 90,000 multi-view images from 263 subjects, capturing rich intra- and inter-subject variability in geometry, skin tone, and appearance. Its primary contribution is an integrated framework that leverages continuous implicit 3D surface parameterization, spatially varying reflectance, and a global scene illumination model, optimized end-to-end via differentiable volume rendering and physically based Disney-style BRDFs.

1. Problem Formulation and Objectives

PALM-Net addresses the task of image-based hand avatar personalization under the following problem setting: the input is a single RGB image II of a right hand, captured in an arbitrary pose and under complex, largely unknown illumination. The method assumes access to a predicted 3D hand pose θ\theta and shape β\beta for the frame, typically provided by off-the-shelf 3D keypoint/pose estimators, such as those used in InterHand2.6M.

The outputs of PALM-Net are:

  • A relightable, personalized 3D hand avatar, comprising:
    • Geometry: a continuous implicit surface representation (SDF-to-opacity field) supporting articulation to arbitrary hand poses θ\theta',
    • Material properties: spatially varying albedo α(x)\alpha(x), roughness r(x)r(x), and metallicity m(x)m(x),
    • Scene illumination: an explicit environment map Lenv(d)L_{\text{env}}(d), parameterized by Spherical Gaussians for novel-view relighting.
  • Novel renderings of the avatar under new poses and/or environment maps.

This enables application to single-image hand avatar personalization, including relighting and free-form gesture animation.

2. Hand Geometry and Material Parameterization

2.1 Canonicalization via MANO and SNARF

Hand geometry is represented using the MANO parametric model, Θ={θ,β,p}\Theta = \{\theta, \beta, p\}, where θR45\theta \in \mathbb{R}^{45} denotes joint-angle pose, βR10\beta \in \mathbb{R}^{10} are the shape PCA coefficients, and pR6p\in\mathbb{R}^6 encodes global rigid transformation. SNARF [Chen et al. ICCV’21] is utilized to invert linear blend skinning, mapping each deformed point xdx_d to a common canonical space xcx_c by solving: xc=argminxi=1nbwi(x)Bixxd22x_c = \arg\min_{x} \bigg\lVert \sum_{i=1}^{n_b} w_i(x) B_i x - x_d \bigg\rVert_2^2 where wi(x)w_i(x) are skinning weights and BiB_i bone transforms.

2.2 Implicit Geometry Network

A continuous, canonical surface representation is learned via an SDF-to-opacity MLP fgf_g: fg:(xc,θ,β,ϕ)(σt(xc),z(xc))f_g: (x_c, \theta, \beta, \phi) \mapsto (\sigma_t(x_c), z(x_c)) Here, xcx_c is the SNARF-canonicalized point, ϕRds\phi \in \mathbb{R}^{d_s} is a learned subject-specific latent geometry code (ds32d_s \approx 32), σt(xc)\sigma_t(x_c) is the local volume density (opacity, via Laplace-CDF mapping), and z(xc)z(x_c) is a learned geometry feature vector forwarded to appearance sub-networks.

2.3 Neural Radiance and Material Fields

The radiance field network frff_{rf} models view-dependent surface emission given geometry, normal, and an appearance code ψ\psi: frf:(xc,z,ref(d,n),n,θ,ψ)L(xc,d)f_{rf}: (x_c, z, \mathrm{ref}(d, n), n, \theta, \psi) \rightarrow L(x_c, d)

  • Surface normal n=xSDF(xc)n = \nabla_x \mathrm{SDF}(x_c),
  • Reflected direction ref(d,n)=d2(dn)n\mathrm{ref}(d, n) = d - 2(d \cdot n)n,
  • Appearance code ψRds\psi \in \mathbb{R}^{d_s}.

The material property network fmf_m outputs: fm:(xc,z,θ,ψ)(α(xc),r(xc),m(xc))f_m: (x_c, z, \theta, \psi) \rightarrow (\alpha(x_c), r(x_c), m(x_c)) where α\alpha is spatially varying, per-axis albedo, rr the roughness scalar, and mm metallicity.

For both geometry and appearance, each subject ii receives distinct latent codes ϕi\phi_i (shape) and ψi\psi_i (appearance), learned end-to-end.

3. Network Architecture

Each network (fgf_g, frff_{rf}, fmf_m) is implemented as an MLP, leveraging hash-grid positional encodings [Müller et al. TOG’22] of xcx_c. Typical configurations use 4–6 layers of 128–256 channels per layer. Conditioning is achieved by concatenating the hash-encoded xcx_c, subject latent codes, MANO pose/shape embeddings, and, where appropriate, geometry features and view/normal directions. The overall connectivity is as follows:

Network Inputs Outputs
fgf_g xcx_c (hash-encoded), ϕ\phi, MANO θ,β\theta,\beta σt(xc)\sigma_t(x_c), z(xc)z(x_c)
frff_{rf} xcx_c, zz, nn, ref(d,n)\mathrm{ref}(d,n), θ\theta, ψ\psi L(xc,d)L(x_c, d) (RGB radiance)
fmf_m xcx_c, zz, θ\theta, ψ\psi α\alpha, rr, mm

This architecture enables the model to capture both global inter-subject variation and personalized detail through subject-specific codes while maintaining parameter efficiency via shared weights across all subjects.

4. Physically Based Inverse Rendering

4.1 Differentiable Volume Rendering

PALM-Net utilizes NeRF-style differentiable volume rendering: Crf(r)=tntfT(tn,t)σt(r(t))L(r(t),d)dtC_{rf}(r) = \int_{t_n}^{t_f} T(t_n, t)\,\sigma_t(r(t))\,L(r(t),d)\,dt where the transmittance T(tn,t)=exp(tntσt(r(s))ds)T(t_n, t) = \exp\left(-\int_{t_n}^t \sigma_t(r(s))ds\right) encodes accumulated opacity. In practical implementation, this integral is approximated by standard quadrature.

4.2 Physically-Based Rendering with Disney-BRDF

The radiance LL in the above is replaced, during training, by a physically based integral using the Disney BRDF: Cpbr(r)i=1Mw(i)BRDF(d,dˉ(i),α,r,m,n)Li(r(tˉ(i)),dˉ(i))pdf(dˉ(i))C_{pbr}(r) \approx \sum_{i=1}^M w^{(i)} \,\mathrm{BRDF}(d, \bar d^{(i)}, \alpha, r, m, n)\,\frac{L_i(r(\bar t^{(i)}), \bar d^{(i)})}{pdf(\bar d^{(i)})} where weights w(i)w^{(i)} are computed from opacity, BRDF\mathrm{BRDF} applies the Disney model, LiL_i incorporates outgoing and environment radiance, and dˉ(i)\bar d^{(i)} are sampled incoming directions. The environment map is modeled as a sum of G=8G=8 learnable Spherical Gaussians. All integrals are differentiable, allowing backpropagation through illumination, geometry, and material networks.

5. Loss Functions and Training Objectives

Supervision is provided by registered 3dMD scan renderings, including ground-truth color C^(r)\hat C(r), normal N^(r)\hat{\mathcal N}(r), and segmentation S^(r)\hat{\mathcal S}(r). The loss is a weighted sum: L=Lrf+λpbrLpbr+λsegLseg+λnormalLnormal+λeikLeik+λLAPLLAP+λlatentLlatent+λLPIPSLLPIPS\mathcal{L} = \mathcal{L}_{rf} + \lambda_{pbr} \mathcal{L}_{pbr} + \lambda_{seg} \mathcal{L}_{seg} + \lambda_{normal} \mathcal{L}_{normal} + \lambda_{eik} \mathcal{L}_{eik} + \lambda_{LAP} \mathcal{L}_{LAP} + \lambda_{latent} \mathcal{L}_{latent} + \lambda_{LPIPS} \mathcal{L}_{LPIPS}

  • Lrf,Lpbr\mathcal{L}_{rf}, \mathcal{L}_{pbr}: Photometric 1\ell_1 losses in rendered color.
  • Lnormal\mathcal{L}_{normal}: 2\ell_2 supervision on predicted vs. scan normals.
  • Lseg\mathcal{L}_{seg}: Foreground binary cross-entropy loss.
  • Leik\mathcal{L}_{eik}: Eikonal regularizer (xSDF(x)21)2(\|\nabla_x \mathrm{SDF}(x)\|_2 - 1)^2, enforcing SDF validity.
  • LLAP\mathcal{L}_{LAP}: Laplacian smoothness penalty to encourage local surface consistency.
  • Llatent\mathcal{L}_{latent}: Regularization of subject latent code norms.
  • LLPIPS\mathcal{L}_{LPIPS}: Patch-based perceptual loss for high-frequency appearance details.

λ\lambda coefficients are determined empirically; λseg\lambda_{seg} is decayed progressively during training.

6. Training Protocol and Dataset

Training uses the PALM dataset, which provides 13,000 hand scans from 263 subjects, with 8–12 synchronized high-resolution (2448×2048) RGB images per instance, supporting multi-view supervision. A single multi-subject model is trained across all subjects. Latent codes (ϕi\phi_i, ψi\psi_i) are initialized at zero and learned jointly with model weights. The global scene environment is parameterized by G=8G=8 Spherical Gaussians. Optimization uses Adam (learning rate 51045 \cdot 10^{-4}) with exponential decay every 50k iterations, across approximately 200k total iterations. Each mini-batch samples roughly 1024 rays from randomly chosen images/cameras, ensuring subject and pose diversity. Data augmentations include random gamma, color jitter, and patch cropping for perceptual loss.

7. Evaluation, Results, and Ablation Analysis

Single-image personalization is evaluated by fitting subject codes and scene environment parameters to a target image (fixing network weights). Performance is reported in peak signal-to-noise ratio (PSNR), structural similarity (SSIM), and LPIPS:

PSNR↑ SSIM↑ LPIPS↓
InterHand2.6M
Handy 7.50 0.69 0.24
HARP 9.89 0.78 0.16
UHM 10.08 0.76 0.19
Ours (PALM-Net) 12.01 0.84 0.15
HARP relit
Handy 12.48 0.76 0.32
HARP 11.93 0.69 0.37
UHM 12.30 0.74 0.31
Ours (PALM-Net) 13.39 0.78 0.35

Qualitative results indicate faithful geometry, albedo estimation, and accurate relighting under both real-world and synthetic variations.

Ablation studies assess the impact of normal supervision and adaptive environment fitting:

  • Training with normal supervision (3dMD ground-truth normals) reduces "pepper" and "floater" artifacts and improves fidelity (e.g., PSNR improves from 11.97 without normals to 12.01 with normals).
  • Optimizing environment Spherical Gaussian parameters during personalization (as opposed to holding them fixed) leads to better fit and reconstruction accuracy (Table 6).

8. Significance and Applications

PALM-Net establishes a strong multi-subject hand prior that synthesizes geometry, reflectance, and illumination in a unified, physically based neural rendering framework. This supports a range of applications such as creation of high-quality, relightable hand avatars from single images, forensic hand modeling, gesture analysis, and personalization of hand models for virtual/augmented reality. The method demonstrates robust generalization across real and synthetic conditions, validated through both numerical and qualitative results. The PALM dataset, together with PALM-Net, addresses prior limitations in subject diversity and physical accuracy in hand avatar reconstruction, providing a scalable resource for further research in photorealistic hand modeling and human digital twin approaches.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to PALM-Net.