PALM-Net: Physically Based Hand Avatar Reconstruction

Updated 14 November 2025

The paper's primary contribution is an integrated framework that combines inverse rendering, continuous implicit 3D surface parameterization, and Disney-style BRDFs to achieve personalized hand avatars.
It leverages a diverse dataset of 13,000 hand scans and 90,000 multi-view images from 263 subjects to capture rich intra- and inter-subject variability in geometry, skin tone, and appearance.
The method integrates differentiable volume rendering with specialized neural networks to deliver robust novel view synthesis and improved performance metrics such as PSNR, SSIM, and LPIPS.

PALM-Net is a multi-subject, physically based, inverse-rendering–based hand prior that enables the construction of personalized, relightable 3D hand avatars from single RGB images. Developed as the baseline method in the PALM dataset initiative, PALM-Net is trained on 13,000 high-quality hand scans and 90,000 multi-view images from 263 subjects, capturing rich intra- and inter-subject variability in geometry, skin tone, and appearance. Its primary contribution is an integrated framework that leverages continuous implicit 3D surface parameterization, spatially varying reflectance, and a global scene illumination model, optimized end-to-end via differentiable volume rendering and physically based Disney-style BRDFs.

1. Problem Formulation and Objectives

PALM-Net addresses the task of image-based hand avatar personalization under the following problem setting: the input is a single RGB image $I$ of a right hand, captured in an arbitrary pose and under complex, largely unknown illumination. The method assumes access to a predicted 3D hand pose $\theta$ and shape $\beta$ for the frame, typically provided by off-the-shelf 3D keypoint/pose estimators, such as those used in InterHand2.6M.

The outputs of PALM-Net are:

A relightable, personalized 3D hand avatar, comprising:
- Geometry: a continuous implicit surface representation (SDF-to-opacity field) supporting articulation to arbitrary hand poses $\theta'$ ,
- Material properties: spatially varying albedo $\alpha(x)$ , roughness $r(x)$ , and metallicity $m(x)$ ,
- Scene illumination: an explicit environment map $L_{\text{env}}(d)$ , parameterized by Spherical Gaussians for novel-view relighting.
Novel renderings of the avatar under new poses and/or environment maps.

This enables application to single-image hand avatar personalization, including relighting and free-form gesture animation.

2. Hand Geometry and Material Parameterization

2.1 Canonicalization via MANO and SNARF

Hand geometry is represented using the MANO parametric model, $\Theta = \{\theta, \beta, p\}$ , where $\theta \in \mathbb{R}^{45}$ denotes joint-angle pose, $\theta$ 0 are the shape PCA coefficients, and $\theta$ 1 encodes global rigid transformation. SNARF [Chen et al. ICCV’21] is utilized to invert linear blend skinning, mapping each deformed point $\theta$ 2 to a common canonical space $\theta$ 3 by solving: $\theta$ 4 where $\theta$ 5 are skinning weights and $\theta$ 6 bone transforms.

2.2 Implicit Geometry Network

A continuous, canonical surface representation is learned via an SDF-to-opacity MLP $\theta$ 7: $\theta$ 8 Here, $\theta$ 9 is the SNARF-canonicalized point, $\beta$ 0 is a learned subject-specific latent geometry code ( $\beta$ 1), $\beta$ 2 is the local volume density (opacity, via Laplace-CDF mapping), and $\beta$ 3 is a learned geometry feature vector forwarded to appearance sub-networks.

2.3 Neural Radiance and Material Fields

The radiance field network $\beta$ 4 models view-dependent surface emission given geometry, normal, and an appearance code $\beta$ 5: $\beta$ 6

Surface normal $\beta$ 7,
Reflected direction $\beta$ 8,
Appearance code $\beta$ 9.

The material property network $\theta'$ 0 outputs: $\theta'$ 1 where $\theta'$ 2 is spatially varying, per-axis albedo, $\theta'$ 3 the roughness scalar, and $\theta'$ 4 metallicity.

For both geometry and appearance, each subject $\theta'$ 5 receives distinct latent codes $\theta'$ 6 (shape) and $\theta'$ 7 (appearance), learned end-to-end.

3. Network Architecture

Each network ( $\theta'$ 8, $\theta'$ 9, $\alpha(x)$ 0) is implemented as an MLP, leveraging hash-grid positional encodings [Müller et al. TOG’22] of $\alpha(x)$ 1. Typical configurations use 4–6 layers of 128–256 channels per layer. Conditioning is achieved by concatenating the hash-encoded $\alpha(x)$ 2, subject latent codes, MANO pose/shape embeddings, and, where appropriate, geometry features and view/normal directions. The overall connectivity is as follows:

Network	Inputs	Outputs
$\alpha(x)$ 3	$\alpha(x)$ 4 (hash-encoded), $\alpha(x)$ 5, MANO $\alpha(x)$ 6	$\alpha(x)$ 7, $\alpha(x)$ 8
$\alpha(x)$ 9	$r(x)$ 0, $r(x)$ 1, $r(x)$ 2, $r(x)$ 3, $r(x)$ 4, $r(x)$ 5	$r(x)$ 6 (RGB radiance)
$r(x)$ 7	$r(x)$ 8, $r(x)$ 9, $m(x)$ 0, $m(x)$ 1	$m(x)$ 2, $m(x)$ 3, $m(x)$ 4

This architecture enables the model to capture both global inter-subject variation and personalized detail through subject-specific codes while maintaining parameter efficiency via shared weights across all subjects.

4. Physically Based Inverse Rendering

4.1 Differentiable Volume Rendering

PALM-Net utilizes NeRF-style differentiable volume rendering: $m(x)$ 5 where the transmittance $m(x)$ 6 encodes accumulated opacity. In practical implementation, this integral is approximated by standard quadrature.

4.2 Physically-Based Rendering with Disney-BRDF

The radiance $m(x)$ 7 in the above is replaced, during training, by a physically based integral using the Disney BRDF: $m(x)$ 8 where weights $m(x)$ 9 are computed from opacity, $L_{\text{env}}(d)$ 0 applies the Disney model, $L_{\text{env}}(d)$ 1 incorporates outgoing and environment radiance, and $L_{\text{env}}(d)$ 2 are sampled incoming directions. The environment map is modeled as a sum of $L_{\text{env}}(d)$ 3 learnable Spherical Gaussians. All integrals are differentiable, allowing backpropagation through illumination, geometry, and material networks.

5. Loss Functions and Training Objectives

Supervision is provided by registered 3dMD scan renderings, including ground-truth color $L_{\text{env}}(d)$ 4, normal $L_{\text{env}}(d)$ 5, and segmentation $L_{\text{env}}(d)$ 6. The loss is a weighted sum: $L_{\text{env}}(d)$ 7

$L_{\text{env}}(d)$ 8: Photometric $L_{\text{env}}(d)$ 9 losses in rendered color.
$\Theta = \{\theta, \beta, p\}$ 0: $\Theta = \{\theta, \beta, p\}$ 1 supervision on predicted vs. scan normals.
$\Theta = \{\theta, \beta, p\}$ 2: Foreground binary cross-entropy loss.
$\Theta = \{\theta, \beta, p\}$ 3: Eikonal regularizer $\Theta = \{\theta, \beta, p\}$ 4, enforcing SDF validity.
$\Theta = \{\theta, \beta, p\}$ 5: Laplacian smoothness penalty to encourage local surface consistency.
$\Theta = \{\theta, \beta, p\}$ 6: Regularization of subject latent code norms.
$\Theta = \{\theta, \beta, p\}$ 7: Patch-based perceptual loss for high-frequency appearance details.

$\Theta = \{\theta, \beta, p\}$ 8 coefficients are determined empirically; $\Theta = \{\theta, \beta, p\}$ 9 is decayed progressively during training.

6. Training Protocol and Dataset

Training uses the PALM dataset, which provides 13,000 hand scans from 263 subjects, with 8–12 synchronized high-resolution (2448×2048) RGB images per instance, supporting multi-view supervision. A single multi-subject model is trained across all subjects. Latent codes ( $\theta \in \mathbb{R}^{45}$ 0, $\theta \in \mathbb{R}^{45}$ 1) are initialized at zero and learned jointly with model weights. The global scene environment is parameterized by $\theta \in \mathbb{R}^{45}$ 2 Spherical Gaussians. Optimization uses Adam (learning rate $\theta \in \mathbb{R}^{45}$ 3) with exponential decay every 50k iterations, across approximately 200k total iterations. Each mini-batch samples roughly 1024 rays from randomly chosen images/cameras, ensuring subject and pose diversity. Data augmentations include random gamma, color jitter, and patch cropping for perceptual loss.

7. Evaluation, Results, and Ablation Analysis

Single-image personalization is evaluated by fitting subject codes and scene environment parameters to a target image (fixing network weights). Performance is reported in peak signal-to-noise ratio (PSNR), structural similarity (SSIM), and LPIPS:

	PSNR↑	SSIM↑	LPIPS↓
InterHand2.6M
Handy	7.50	0.69	0.24
HARP	9.89	0.78	0.16
UHM	10.08	0.76	0.19
Ours (PALM-Net)	12.01	0.84	0.15
HARP relit
Handy	12.48	0.76	0.32
HARP	11.93	0.69	0.37
UHM	12.30	0.74	0.31
Ours (PALM-Net)	13.39	0.78	0.35

Qualitative results indicate faithful geometry, albedo estimation, and accurate relighting under both real-world and synthetic variations.

Ablation studies assess the impact of normal supervision and adaptive environment fitting:

Training with normal supervision (3dMD ground-truth normals) reduces "pepper" and "floater" artifacts and improves fidelity (e.g., PSNR improves from 11.97 without normals to 12.01 with normals).
Optimizing environment Spherical Gaussian parameters during personalization (as opposed to holding them fixed) leads to better fit and reconstruction accuracy (Table 6).

8. Significance and Applications

PALM-Net establishes a strong multi-subject hand prior that synthesizes geometry, reflectance, and illumination in a unified, physically based neural rendering framework. This supports a range of applications such as creation of high-quality, relightable hand avatars from single images, forensic hand modeling, gesture analysis, and personalization of hand models for virtual/augmented reality. The method demonstrates robust generalization across real and synthetic conditions, validated through both numerical and qualitative results. The PALM dataset, together with PALM-Net, addresses prior limitations in subject diversity and physical accuracy in hand avatar reconstruction, providing a scalable resource for further research in photorealistic hand modeling and human digital twin approaches.

Markdown Report Issue Upgrade to Chat

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PALM-Net.

PALM-Net: Physically Based Hand Avatar Reconstruction

1. Problem Formulation and Objectives

2. Hand Geometry and Material Parameterization

2.1 Canonicalization via MANO and SNARF

2.2 Implicit Geometry Network

2.3 Neural Radiance and Material Fields

3. Network Architecture

4. Physically Based Inverse Rendering

4.1 Differentiable Volume Rendering

4.2 Physically-Based Rendering with Disney-BRDF

5. Loss Functions and Training Objectives

6. Training Protocol and Dataset

7. Evaluation, Results, and Ablation Analysis

8. Significance and Applications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

PALM-Net: Physically Based Hand Avatar Reconstruction

1. Problem Formulation and Objectives

2. Hand Geometry and Material Parameterization

2.1 Canonicalization via MANO and SNARF

2.2 Implicit Geometry Network

2.3 Neural Radiance and Material Fields

3. Network Architecture

4. Physically Based Inverse Rendering

4.1 Differentiable Volume Rendering

4.2 Physically-Based Rendering with Disney-BRDF

5. Loss Functions and Training Objectives

6. Training Protocol and Dataset

7. Evaluation, Results, and Ablation Analysis

8. Significance and Applications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research