Identity-Style Norm. with 3D Priors
- The paper introduces a method that uses per-vertex target normals and a differentiable ARAP layer to balance expressive style transfer with robust identity preservation.
- It employs SVD-based rotation estimation and Poisson solvers for smooth mesh deformation, integrating 3D priors into the style and attribute mixing process.
- Empirical results demonstrate superior trade-offs in identity fidelity and stylization quality for both mesh stylization and video face swapping compared to baseline methods.
Identity-style normalization with 3D priors is an emerging paradigm in geometric deep learning and cross-modal generative modeling where style-driven deformation or transfer is regularized by priors derived from 3D structure. The core objective is to enable expressive stylization or identity transfer (e.g., in mesh stylization or face swapping), while rigorously preserving the geometric identity of the underlying shape or subject. Two state-of-the-art frameworks ― Geometry in Style (mesh stylization) and DynamicFace (video face swapping) ― exemplify this principle by parameterizing deformation or attribute mixing through 3D priors, such as per-vertex surface normals or morphable face model coefficients, and enforcing identity preservation via explicit or implicit geometric constraints (Dinh et al., 29 Mar 2025, Wang et al., 15 Jan 2025).
1. Geometric Deformation with 3D Priors
Geometry in Style formulates stylization as a deformation of a triangle mesh , where identity preservation is governed by surface normal priors rather than unconstrained vertex displacements. Each vertex is assigned a per-vertex target normal (unit-norm), which commands the stylization degree at each local patch. The local deformation is solved as a rotation that best aligns both the original normal and the edge vectors in its 1-ring neighborhood to their stylized counterparts, formalized by a local Procrustes energy:
where are cotangent weights, are Voronoi masses, and modulates identity-vs-style tradeoff. The best-fit rotation 0 is obtained via orthogonal Procrustes (SVD-based).
DynamicFace harnesses 3D morphable face models to explicitly parameterize facial attributes into disentangled coefficients: identity (shape, 1), expression (2), pose (3), and albedo (4). This enables mixing source and target attributes in a physically grounded manner (Wang et al., 15 Jan 2025).
2. Differentiable As-Rigid-As-Possible (dARAP) Layer
Geometry in Style introduces the differentiable As-Rigid-As-Possible (dARAP) layer, adapting the classical ARAP formulation for smooth and closed-form deformation, compatible with gradient-based optimization. For a mesh deformation 5, the global ARAP energy,
6
is iteratively minimized by alternating between (i) local SVD-based rotation solves (controlled by target normals), and (ii) a global sparse Poisson solve for vertex positions (via the cotangent Laplacian). The entire pipeline is differentiable; gradients propagate through both SVD-based rotation estimation and the linear system solve, enabling end-to-end training under extrinsic supervision (e.g., rendered image losses) (Dinh et al., 29 Mar 2025).
3. Identity Preservation via Rigidity Priors
The dARAP layer itself acts as a strong geometric prior: each local patch is encouraged to deform via rotation only, prohibiting scaling, shear, or degenerate folding. This ensures that extrinsic stylization (e.g., adding ripples, blocky artifacts) does not compromise global part correspondences, silhouette, or structural integrity. Unlike purely Jacobian-regularized methods, dARAP avoids the necessity of additional L₂ identity losses, as the rigidity term is implicit in the energy. The 7 parameter explicitly controls the balance: excessive 8 overfits normals at the risk of self-intersection, whereas low 9 underrepresents style changes (Dinh et al., 29 Mar 2025).
In DynamicFace, identity–style normalization is enforced by explicitly separating identity, pose, expression, and lighting conditions through 3D priors. Four disentangled image-based conditions are generated for each frame: (1) shape-aware pose (normal map), (2) background-preservation mask, (3) expression (semantic map), and (4) illumination (blurred UV texture). Each guides different UNet pipelines, ensuring high-level semantics and fine appearance (via FaceFormer and ReferenceNet modules) are properly injected without compromising subject identity (Wang et al., 15 Jan 2025).
4. Integration with High-level Generative Models
Geometry in Style integrates its 3D deformation pipeline with a text-to-image model by rendering the stylized mesh from multiple views and passing images into a pre-trained 2D diffusion model, guided by a semantic visual loss (Cascaded Score Distillation):
0
1
where the gradients flow through the rasterizer and dARAP solver to optimize target normals 2. The overall stylization loop alternates between updating target normals via Adam, local/global dARAP solves, and feedback from visual loss (Dinh et al., 29 Mar 2025).
DynamicFace attaches zero-initialized guider heads to each 2D “condition” and fuses them into UNet backbone features, initializing from pretrained weights and allowing fine-grained guidance with minimal catastrophic forgetting. Identity injection is performed through Face Former (high-level tokens, ArcFace) and ReferenceNet (spatial-attention), with training losses encompassing reconstruction, identity, expression, pose, semantic keypoints, and, for video, temporal consistency (CLIP-based and warping error) (Wang et al., 15 Jan 2025).
5. Empirical Results and Comparison to Baselines
Geometry in Style demonstrates superior trade-off in area preservation (mean triangle area ratio ≈1.08, stdev ≈0.23 across 20 shapes) compared to TextDeformer (0.83±0.36) and MeshUp (1.29±0.36). CLIP similarity to text prompt is on par or slightly better (0.655 vs. 0.653 for MeshUp and 0.650 for TextDeformer). Qualitatively, the method supports expressive style (e.g., Lego blockiness, armor effect) while maintaining pose and part identity. Control over 3 enables users to adjust the fidelity-vs-style trade-off at inference. Bump-map approaches are limited to small surface effects, and Jacobian-deformation baselines tend to degrade limb fidelity and silhouette (Dinh et al., 29 Mar 2025).
DynamicFace achieves state-of-the-art results for video face swapping: on FaceForensics++, identity retrieval is 99.20% (vs. ∼98.7% prior SOTA), mouth-L2 error is 1.69px, and eye-L2 error is 0.16px. Ablation studies confirm the necessity of each 3D condition: exclusion of any degrades pose, background, or expression fidelity. Temporal consistency with plug-and-play layers raises frame consistency from 95.78% (without) to 99.02% (with), while warping error nearly doubles without temporal modeling. Removing either Face Former or ReferenceNet results in a 4–5% drop in identity similarity (Wang et al., 15 Jan 2025).
6. Limitations and Extensions
Geometry in Style is constrained to manifold meshes with moderate aspect ratios; topology modifications and collision/self-intersection handling are not supported. Overly strong stylization parameters may cause mesh collapse in thin geometries or fail to capture high-frequency details beyond surface normal bandwidth. Extensions may include coupling with segmentation for spatially localized stylization, differentiable collision/physics, support for topology change, or exploration of conformal and higher-order geometric priors (Dinh et al., 29 Mar 2025).
DynamicFace depends on high-quality 3DMM fits. Disentanglement errors in the 3D prior estimation propagate directly into condition quality. Potential extensions include learned or unsupervised segmentation priors for more granular region control, advanced attention/fusion mechanisms, and joint optimization of 3D priors with diffusion model weights (Wang et al., 15 Jan 2025).
7. Schematic Workflow Overview
| Framework | 3D Priors | Stylization/Transfer Mechanism | Identity Preservation Mechanism |
|---|---|---|---|
| Geometry in Style | Per-vertex target normals | dARAP (SVD+Poisson) + diffusion model visual loss | Local rigidity + global smoothness via ARAP |
| DynamicFace | 3DMM: identity, pose, expression, albedo | Mixture-of-Guider, UNet fusion, FaceFormer/ReferenceNet | Disentangled 3D conditions + facial feature tokens |
Identity-style normalization with 3D priors has established itself as a principled approach for stylized deformation and attribute transfer, providing explicit disentanglement between style and object identity by leveraging differentiable geometric constraints and physically-founded parameterizations (Dinh et al., 29 Mar 2025, Wang et al., 15 Jan 2025).