PALM-Net: Physically Based Hand Avatar Reconstruction
- The paper's primary contribution is an integrated framework that combines inverse rendering, continuous implicit 3D surface parameterization, and Disney-style BRDFs to achieve personalized hand avatars.
- It leverages a diverse dataset of 13,000 hand scans and 90,000 multi-view images from 263 subjects to capture rich intra- and inter-subject variability in geometry, skin tone, and appearance.
- The method integrates differentiable volume rendering with specialized neural networks to deliver robust novel view synthesis and improved performance metrics such as PSNR, SSIM, and LPIPS.
PALM-Net is a multi-subject, physically based, inverse-rendering–based hand prior that enables the construction of personalized, relightable 3D hand avatars from single RGB images. Developed as the baseline method in the PALM dataset initiative, PALM-Net is trained on 13,000 high-quality hand scans and 90,000 multi-view images from 263 subjects, capturing rich intra- and inter-subject variability in geometry, skin tone, and appearance. Its primary contribution is an integrated framework that leverages continuous implicit 3D surface parameterization, spatially varying reflectance, and a global scene illumination model, optimized end-to-end via differentiable volume rendering and physically based Disney-style BRDFs.
1. Problem Formulation and Objectives
PALM-Net addresses the task of image-based hand avatar personalization under the following problem setting: the input is a single RGB image of a right hand, captured in an arbitrary pose and under complex, largely unknown illumination. The method assumes access to a predicted 3D hand pose and shape for the frame, typically provided by off-the-shelf 3D keypoint/pose estimators, such as those used in InterHand2.6M.
The outputs of PALM-Net are:
- A relightable, personalized 3D hand avatar, comprising:
- Geometry: a continuous implicit surface representation (SDF-to-opacity field) supporting articulation to arbitrary hand poses ,
- Material properties: spatially varying albedo , roughness , and metallicity ,
- Scene illumination: an explicit environment map , parameterized by Spherical Gaussians for novel-view relighting.
- Novel renderings of the avatar under new poses and/or environment maps.
This enables application to single-image hand avatar personalization, including relighting and free-form gesture animation.
2. Hand Geometry and Material Parameterization
2.1 Canonicalization via MANO and SNARF
Hand geometry is represented using the MANO parametric model, , where denotes joint-angle pose, are the shape PCA coefficients, and encodes global rigid transformation. SNARF [Chen et al. ICCV’21] is utilized to invert linear blend skinning, mapping each deformed point to a common canonical space by solving: where are skinning weights and bone transforms.
2.2 Implicit Geometry Network
A continuous, canonical surface representation is learned via an SDF-to-opacity MLP : Here, is the SNARF-canonicalized point, is a learned subject-specific latent geometry code (), is the local volume density (opacity, via Laplace-CDF mapping), and is a learned geometry feature vector forwarded to appearance sub-networks.
2.3 Neural Radiance and Material Fields
The radiance field network models view-dependent surface emission given geometry, normal, and an appearance code :
- Surface normal ,
- Reflected direction ,
- Appearance code .
The material property network outputs: where is spatially varying, per-axis albedo, the roughness scalar, and metallicity.
For both geometry and appearance, each subject receives distinct latent codes (shape) and (appearance), learned end-to-end.
3. Network Architecture
Each network (, , ) is implemented as an MLP, leveraging hash-grid positional encodings [Müller et al. TOG’22] of . Typical configurations use 4–6 layers of 128–256 channels per layer. Conditioning is achieved by concatenating the hash-encoded , subject latent codes, MANO pose/shape embeddings, and, where appropriate, geometry features and view/normal directions. The overall connectivity is as follows:
| Network | Inputs | Outputs |
|---|---|---|
| (hash-encoded), , MANO | , | |
| , , , , , | (RGB radiance) | |
| , , , | , , |
This architecture enables the model to capture both global inter-subject variation and personalized detail through subject-specific codes while maintaining parameter efficiency via shared weights across all subjects.
4. Physically Based Inverse Rendering
4.1 Differentiable Volume Rendering
PALM-Net utilizes NeRF-style differentiable volume rendering: where the transmittance encodes accumulated opacity. In practical implementation, this integral is approximated by standard quadrature.
4.2 Physically-Based Rendering with Disney-BRDF
The radiance in the above is replaced, during training, by a physically based integral using the Disney BRDF: where weights are computed from opacity, applies the Disney model, incorporates outgoing and environment radiance, and are sampled incoming directions. The environment map is modeled as a sum of learnable Spherical Gaussians. All integrals are differentiable, allowing backpropagation through illumination, geometry, and material networks.
5. Loss Functions and Training Objectives
Supervision is provided by registered 3dMD scan renderings, including ground-truth color , normal , and segmentation . The loss is a weighted sum:
- : Photometric losses in rendered color.
- : supervision on predicted vs. scan normals.
- : Foreground binary cross-entropy loss.
- : Eikonal regularizer , enforcing SDF validity.
- : Laplacian smoothness penalty to encourage local surface consistency.
- : Regularization of subject latent code norms.
- : Patch-based perceptual loss for high-frequency appearance details.
coefficients are determined empirically; is decayed progressively during training.
6. Training Protocol and Dataset
Training uses the PALM dataset, which provides 13,000 hand scans from 263 subjects, with 8–12 synchronized high-resolution (2448×2048) RGB images per instance, supporting multi-view supervision. A single multi-subject model is trained across all subjects. Latent codes (, ) are initialized at zero and learned jointly with model weights. The global scene environment is parameterized by Spherical Gaussians. Optimization uses Adam (learning rate ) with exponential decay every 50k iterations, across approximately 200k total iterations. Each mini-batch samples roughly 1024 rays from randomly chosen images/cameras, ensuring subject and pose diversity. Data augmentations include random gamma, color jitter, and patch cropping for perceptual loss.
7. Evaluation, Results, and Ablation Analysis
Single-image personalization is evaluated by fitting subject codes and scene environment parameters to a target image (fixing network weights). Performance is reported in peak signal-to-noise ratio (PSNR), structural similarity (SSIM), and LPIPS:
| PSNR↑ | SSIM↑ | LPIPS↓ | |
|---|---|---|---|
| InterHand2.6M | |||
| Handy | 7.50 | 0.69 | 0.24 |
| HARP | 9.89 | 0.78 | 0.16 |
| UHM | 10.08 | 0.76 | 0.19 |
| Ours (PALM-Net) | 12.01 | 0.84 | 0.15 |
| HARP relit | |||
| Handy | 12.48 | 0.76 | 0.32 |
| HARP | 11.93 | 0.69 | 0.37 |
| UHM | 12.30 | 0.74 | 0.31 |
| Ours (PALM-Net) | 13.39 | 0.78 | 0.35 |
Qualitative results indicate faithful geometry, albedo estimation, and accurate relighting under both real-world and synthetic variations.
Ablation studies assess the impact of normal supervision and adaptive environment fitting:
- Training with normal supervision (3dMD ground-truth normals) reduces "pepper" and "floater" artifacts and improves fidelity (e.g., PSNR improves from 11.97 without normals to 12.01 with normals).
- Optimizing environment Spherical Gaussian parameters during personalization (as opposed to holding them fixed) leads to better fit and reconstruction accuracy (Table 6).
8. Significance and Applications
PALM-Net establishes a strong multi-subject hand prior that synthesizes geometry, reflectance, and illumination in a unified, physically based neural rendering framework. This supports a range of applications such as creation of high-quality, relightable hand avatars from single images, forensic hand modeling, gesture analysis, and personalization of hand models for virtual/augmented reality. The method demonstrates robust generalization across real and synthetic conditions, validated through both numerical and qualitative results. The PALM dataset, together with PALM-Net, addresses prior limitations in subject diversity and physical accuracy in hand avatar reconstruction, providing a scalable resource for further research in photorealistic hand modeling and human digital twin approaches.