GroomCap: High-Fidelity Prior-Free Hair Capture

Published 1 Sep 2024 in cs.GR | (2409.00831v4)

Abstract: Despite recent advances in multi-view hair reconstruction, achieving strand-level precision remains a significant challenge due to inherent limitations in existing capture pipelines. We introduce GroomCap, a novel multi-view hair capture method that reconstructs faithful and high-fidelity hair geometry without relying on external data priors. To address the limitations of conventional reconstruction algorithms, we propose a neural implicit representation for hair volume that encodes high-resolution 3D orientation and occupancy from input views. This implicit hair volume is trained with a new volumetric 3D orientation rendering algorithm, coupled with 2D orientation distribution supervision, to effectively prevent the loss of structural information caused by undesired orientation blending. We further propose a Gaussian-based hair optimization strategy to refine the traced hair strands with a novel chained Gaussian representation, utilizing direct photometric supervision from images. Our results demonstrate that GroomCap is able to capture high-quality hair geometries that are not only more precise and detailed than existing methods but also versatile enough for a range of applications.

Abstract PDF HTML Upgrade to Chat

Citations (1)

View on Semantic Scholar

Summary

The paper introduces a prior-free method using neural implicit representations to capture detailed 3D hair geometry from multi-view images.
It employs a three-stage pipeline with volumetric hair tracing and Gaussian-based optimization to refine approximately 150K hair strands.
The method demonstrates robust performance across diverse hairstyles, outperforming existing techniques in realistic hair reconstruction.

High-Fidelity Prior-Free Hair Capture

The paper "GroomCap: High-Fidelity Prior-Free Hair Capture" (2409.00831) introduces a novel multi-view hair capture method, GroomCap, that reconstructs faithful and high-fidelity hair geometry without relying on external data priors. This method addresses limitations of conventional reconstruction algorithms by proposing a neural implicit representation for hair volume that encodes high-resolution 3D orientation and occupancy from input views. Furthermore, the paper introduces a Gaussian-based hair optimization strategy to refine the traced hair strands with a novel chained Gaussian representation, utilizing direct photometric supervision from images.

Method Overview

GroomCap's pipeline comprises three stages. The first stage establishes an implicit hair volume encoding spatial occupancy and orientation from multi-view image captures. (Figure 1) shows the inputs to the pipeline. In the second stage, initial hair strands are grown within the hair volume based on heuristics. The final stage refines the hair geometry through Gaussian-based hair optimization, utilizing differentiable rendering with chained hair Gaussians. The pipeline outputs approximately 150K hair strands, each explicitly represented as a polyline with 100 points.

Figure 1: The input to our pipeline includes calibrated multi-view images (left), semantic segmentations of hair and foreground (middle), reconstructed inner and outer meshes with the hair bounding box (right), and optional hair partline annotation on one image (middle column, first row).

Data Acquisition and Preparation

The method uses a multi-camera system with 64 cameras at 4K resolution under uniform illumination. Semantic segmentation masks categorize each pixel as background, hair, or body. A rough surface reconstruction of the subject is achieved using the technique described in [guo2019relightables], dilated to encompass all hairs, serving as the outer mesh. A fitted parametric head mesh model provides the inner mesh, approximating the subject's bald surface for locating the hair scalp. A loose 3D bounding box of the hairs is derived by projecting per-view hair segmentation onto the outer mesh. An optional 2D annotation of the parting line can be provided from a top-down view.

Neural Hair Volume

The implicit hair volume is formulated as an MLP network $\mathcal{V}$ . The input to $\mathcal{V}$ is a 3D position $\in ^3$ , and the output includes volume density $\sigma \in [0, 1]$ , hair occupancy $\rho_h \in [0, 1]$ , body occupancy $\rho_b \in [0, 1]$ and 3D hair orientation in polar angles $(\theta \in (0, \pi], \phi \in (0, \pi])$ . During training, $\mathcal{V}$ is additionally fed with the view direction vector $\in ^3$ and receives the view-dependent radiance color $\in ^3$ , similar to NeRF. The model architecture (Figure 2) includes a shared feature network, an appearance network, and a structure network.

Figure 2: The implicit hair volume network comprises three sub-modules: the feature network and appearance network are used to estimate view-independent volume density $\sigma$ and view-dependent radiance from input position and view direction , similar to NeRF; an additional structure network is devised to estimate hair $\rho_h$ and body occupancy $\rho_b$ as well as 3D orientation $(\theta, \phi)$ in polar angles.

Neural Orientation Field

The volumetric orientation plays a crucial role in defining the 3D hair structure. The method optimizes a neural orientation field that estimates 3D orientations without explicit resolution limitations. To construct this field, a new formulation "renders" 3D orientations within the volume rendering paradigm. A single 3D orientation is expanded into a distribution using a predefined kernel function as its PDF. Based on the distribution formulation $\mathcal{H}_$, the accumulated 3D distribution $\mathcal{G}_r$ along an arbitrary ray $r$ with PDF is computed using volume rendering.

Supervision with 2D Orientations

The method supervises accumulated 3D orientation distributions $\mathcal{G}$ using multi-view images. Instead of representing a pixel's 2D orientation with a single value, the responses of all filters are maintained, forming a distribution of 2D orientations. The 3D orientation distribution $\mathcal{G}$ of each ray is projected into a distribution of 2D orientations $\mathcal{F}$ . The loss function for the neural orientation field is defined as the integral of the squared difference between the projected 2D orientation distribution and the normalized response of the orientation filter at angle $\eta$ .

Volume Rendering of 3D Orientations

The paper introduces a novel approach to volume rendering of 3D orientations by expanding single 3D orientations, represented as polar angles, into distributions and performing alpha-blending on these distributions. For a 3D position $, with polar angles$ (\theta_, \phi_) $, its distribution of 3D orientations$ \mathcal{H}_ $is constructed using a predefined kernel function as its PDF$ h_{} $:$ h_(\theta, \phi) = \frac{1}{C_} h'_(\theta, \phi)

$$ h'(\theta, \phi) = \frac{1}{\beta(||\theta - \theta||2 + ||\phi - \phi_||2) + \delta} $$

C_ = \iint_{0}^{\pi} h'_(\theta, \phi)\diff \theta \diff \phi \mathrm{.}

$### Neural Occupancy Field The implicit hair volume further establishes neural occupancy fields by predicting hair occupancy value $\rho_h $and body occupancy value$ \rho_b $at any given position. The continuous values of hair occupancy naturally align with the fact that hairs are semi-transparent in images. The occupancy values$ \rho_* $are accumulated using the standard volume rendering formula to give per-pixel labels$ \psi_{*} $and supervised by pseudo ground truth (GT) segmentation labels$ \bar{\psi}_{*} $:$

\mathcal{L}_\mathrm{occ} = ||\psi_h - \bar{\psi}_h||² +||\psi_b - \bar{\psi}_b||².

$The GT masks do not need to be perfect; the segmentations estimated by the method outperform GT due to implicitly integrated multi-view information. ### Training Strategy The model undergoes a two-phase training process. Initially, only the feature and appearance networks are trained with the conventional L2 photometric loss. In the subsequent phase, the structure network is trained alone with loss $100 \mathcal{L}_\mathrm{ori} + 0.02 \mathcal{L}_\mathrm{occ}$, and the other two modules are frozen. The reconstructed outer mesh is used to decide the depth sampling range of the rays, ensuring the model focuses exclusively on the hair volume. ## Volumetric Hair Tracing Once the hair volume is established, hair strands are extracted by tracing within the volume using the inferred volumetric orientation and occupancy with forward Newton method. At timestep $k $, each strand is extended by a fixed length$ l = 3\mathrm{mm} $to a new point$ _k=_{k-1}+l \cdot \mathrm{norm}(_k) $, where the growing direction$ _k $before normalization is calculated as:$

\begin{split} k=&\gamma\cdot{k-1}+&(1-\gamma)\cdot\big(\sign(\cdot_{k-1})\cdot+\lambda\min(\cdot_{k-1},0)\cdot\big). \end{split} $Tracing is initialized from seed points uniformly sampled within the bounding box volume between the inner and the outer mesh, organized into a priority queue, weighted by the product of volume density and hair occupancy,$ \sigma \cdot \rho_h $. A health value is monitored for each strand, ceasing tracing when this value drops to 0. Volume hairs, traced from seeds in this step, are connected to the scalp by tracing additional scalp hairs, initiated by sampling seeds on the scalp region of the inner mesh. If a parting line is annotated for the hairstyle, all hairs crossing it are removed as a refinement step. The final output is a collection of$ N_s $strands$ \mathcal{S} = \{_1, _2, ..., _{N_s}\} $, resampled to$ N_k = 100 $vertices,$ _i \in ^{N_k \times 3}$. ## Gaussian-Based Strand Optimization This stage uses direct supervision from the original images to recuperate lost fine details, ensuring a match to the captured imagery. The image-based differentiable rendering framework of 3D Gaussian spatting (3DGS) is used to optimize the reconstructed hairs using photometric losses. The method introduces a novel chained hair Gaussian formulation that constrains the relationships among Gaussians along each strand, aligning with the inherent geometric nature of hair. ### Formulation of Chained Hair Gaussians The optimization targets are the parameters of hair geometry, rather than the shape and appearance parameters of individual Gaussians. The elementary unit of strands are defined as line segments. For a strand of $N_k $vertices, the segment between vertex$ _i $and$ _{i + 1} $is denoted by$ _i $, characterized by the following parameters: head vertex$ _i $, tail vertex$ _{i+1} $, diameter$ d $, opacity$ o $, and spherical harmonics (SH) coefficients$ . In the chained Gaussian representation, each segment $_i$ is approximated by a Gaussian centered at the midpoint $(_i + _{i + 1}) / 2$ . The covariance matrix $C$ of this Gaussian is expressed as:

$C = E D D^T E^T.$

Here, $E = [_i, '_i, ''_i]^T$ represents the principle axes of the Gaussian, where $_i$ is the unit direction vector of the segment $_{i+1} - _i$ , and $'_i$ and $''_i$ are two orthogonal unit vectors to $_i$ . The matrix $D = \mathrm{diag}[\tau_l, \tau_d, \tau_d]$ contains scales of the axes, with $\tau_l = ||v_{i+1} - v_{i}|| / 2$ and $\tau_d = d / 2$ being the axial and radial scales, respectively. Auxiliary body Gaussians, anchored at the vertices of the inner mesh and modeled as discs with optimizable radii $w$ , are incorporated to model the non-hair foreground, serving as proxies for occlusion.

Geometry Parameters

Instead of directly optimizing the positions of strand vertices, a low-dimensional latent vector is optimized for each strand. For each subject, a strand variational autoencoder (strand-VAE) is trained that encodes a latent code $\in ^{128}$ from root-relative vertex positions $' \in ^{(N_k - 1) \times 3}$ .

Appearance Parameters

To limit the per-strand appearance DoF, the following simplifications are proposed to the hair appearance parameters: elimination of view variations of color by reducing the SH degree to zero; spatial variations of color by optimizing the color for only 8 segments (anchors) uniformly distributed along the strand, with color for other segments derived via piecewise linear interpolation; parameterization of segment diameters using 8 anchors; restriction of each strand to 2 opacity values: $o_1$ for the first $N_k - N_t - 1$ segments starting from the root, and $o_2$ for the final $N_t = 8$ segments.

Adaptive Control of Hair Gaussians

During optimization, the strand distribution is adaptively controlled by periodically employing heuristic-based actions including splitting and pruning. For each strand $s_i$ with $N_k - 1$ segments, given its per-segment diameters $d_{i, j}$ and opacities $o_{i, j}$ for the $j$ -th vertex, the per-strand split score $\omega_i$ is computed as:

$\omega_i = \frac{\hat{\omega_i}{\frac{1}{N_s} \sum_{i = 1}^{N_s}\hat{\omega}_i}, \quad \hat{\omega}_i = \sum_{j=1}^{N_k - 1} d_{i,j} \cdot o_{i,j} \mathrm{.}$

Strands are split into $\lceil \omega_i \rceil$ new strands, whose vertices are generated by randomly displacing the original positions within its diameter. Invisible strands are identified and pruned based on their opacity and color. Strands whose average vertex opacity falls below a threshold of 0.1 are removed, and strands whose average color is closer to the background color than to the average hair color are pruned.

Training Objectives

The primary loss during optimization is the L2 photometric distance between rendered images and reference images, denoted as $\mathcal{L}_\mathrm{i}$ . Additional regularization terms include:

Volume Guidance Term:

$\mathcal{L}_\mathrm{n} = \frac{1}{N_k - 1} \sum_{i=1}^{N_k-1} \min(||_i - _i||, ||_i + _i||),$

where $_i$ is the direction of the hair segment, and $_i$ is the undirectional 3D orientation prediction at $(_{i + 1} + _{i}) / 2$ .
Penetration Prevention Term:

$\mathcal{L}_\mathrm{p} = \frac{1}{N_k} \sum_{i=1}^{N_k} ||_i - \tilde{}_i||^2,$

where $\tilde{}_i$ is the nearest point on the inner mesh surface to $_i$ .

Heuristic Terms:

Diameter term: $\mathcal{L}_d = \sum_{i=1}^{N_k - 1} |d_i| / (N_k - 1)$
Latent regularization term: $\mathcal{L}_ = | - \hat{}|$
Body radius term: $\mathcal{L}_b = \sum_{i=1}^{N_b} ||w_i - \hat{w}_i ||^2/N_b$

The overall training objective is:

$\mathcal{L} = \lambda_\mathrm{i}\mathcal{L}_\mathrm{i} + \lambda_\mathrm{n}\mathcal{L}_\mathrm{n} + \lambda_\mathrm{p}\mathcal{L}_\mathrm{p} + \lambda_\mathrm{d}\mathcal{L}_\mathrm{d} + \lambda_\mathrm{}\mathcal{L}_\mathrm{} + \lambda_\mathrm{b}\mathcal{L}_\mathrm{b},$

where the weights are set as described in the paper.

Experimental Results

The results on various hairstyles captured in the studio show that the method can reconstruct diverse hairstyles that surpass the coverage of any existing dataset, capturing personal details such as hairlines, fringes, and clusters. (Figure 3) shows the results on various hairstyles. GroomCap can handle short hairs and long ponytails using the same pipeline.

Figure 3: Reconstruction results on diverse hairstyles from short hairs to long ponytails, where personal features such as fringe, hairline, and clusters are faithfully captured. We use the same predefined material to better show geometric details.

Comparisons with state-of-the-art multi-view hair reconstruction works MonoHair and Neural Haircut on the in-the-wild NHC dataset demonstrate that while the reconstructions on this dataset are inferior to the primary setting due to the imperfect inputs, they remain comparable with the concurrent work of MonoHair and outperform the earlier work of NeuralHaircut. (Figure 4) shows a comparison of GroomCap with existing methods.

Figure 4: Comparions with existing hair reconstruction methods. We compare GroomCap with MonoHair [wu24monohair] and Neural Haircut [sklyarova23neural] on the in-the-wild NHC dataset, rendered with the same renderer. The rendering camera of NeuralHaircut results are manually adjusted to match the input.

Ablation Studies

Ablation studies for implicit hair volume show that supervising only the maximum angles results in locally over-smooth strands because non-maximum orientations are discarded, while blending 3D orientations by directly summing polar angles yields even worse results, as it is mathematically flawed. Estimated volume densities and 3D orientations show that the 3D orientations match different hair layers and lead to correct hair intersections, which is crucial for avoiding over-smoothness for tracing. The ablation study for implicit hair volume is shown in (Figure 5).

Figure 5: Ablation studies for implicit hair volume. We show strands traced from different hair volumes, including full method (second column), 2D supervision of maximum orientations without keeping the distribution (third column), and directly alpha-blending 3D orientation angles without our rendering algorithm (fourth column). The results are either overly smoothed (third column) or contain incomplete and sparser strands (fourth column).

Ablation of the Gaussian-based optimization stage demonstrates that the optimization leads to improved hair boundaries, more uniform hair density, and more natural strand geometry.

(Figure 6) visualizes 3D orientation predictions. (Figure 7) shows ablation studies for Gaussian-based hair optimization. (Figure 8) and (Figure 9) show additional ablations for Gaussian-based hair optimization and the strand latent space, respectively.

Figure 6: Visualization of 3D orientation predections. On an example subject, we show a reference view (top-left) and the corresponding hair reconstruction (top-middle). In the reference view, we highlight a sample patch (the white square) where two intersecting wisps are accruately captured in the output. In the lower part of this figure, we plot voxel densities along the ray path at the center of the patch using a line chart. For each density peak, we visualize the corresponding predicted 3D orientation by drawing an arrow over the patch. The first peak represents the front hair wisp with a 3D orientation in camera space of [-0.59, -0.77, 0.23] and a 2D projection of $52.5^\circ$ . Beginning at depth $1.75$m (the fourth peak), the ray intersects the back layer of hair, with 2D projections ranging from $107^\circ$ to $130^\circ$ . At the top-right, we visualize the accumulated 2D orientation distribution along the same ray at the patch center, identifying two peaks. The first peak at $48^\circ$ correlates to the front hairs, while the second peak at $143^\circ$ corresponds to the hair at the back.

Figure 7: Ablation studies for Gaussian-based hair optimization. In the second and third columns, we show hair models before and after optimization, respectively. The optimization effectively consolidates the hair boundary and enhance overall smoothness. In the fourth column, we show an initial hair model that is intentionally smoothed from the traced hairs to better highlight the difference brought by optimization. The fifth column demonstrates that, even from this smoothed initial hair, the optimization is capable of faithfully recovering detailed features. However, as shown in the sixth column, keeping the high degree-of-freedom parameters of the vanilla 3DGS leads to flattened strands, which underscores the importance of our tailored Gaussian parameters.

Figure 8: Additional ablations for Gaussian-based hair optimization. Each triplet shows the reference view (left), the result of our full method (middle), and the result of the ablated baseline (right). Top left: the hair without adaptive splitting suffers from worse coverage and wisp structures. Top right: optimization without adaptive pruning leads to excessively long strands. Bottom left: using a pre-trained prior strand-VAE leads to overly smoothed strands due to poor coverage of the synthetic data. Bottom right: regularization with the implicit hair volume helps enhance the hair structure.

Figure 9: Ablation study for the strand latent space. Optimization within the strand latent space of strand-VAE achieves globally consistent strand deformations that are smoothly regularized (top-middle and bottom-left). In contrast, replacing this latent space regularization with a strong smoothing term fails to prevent sharp turns in the strands (bottom-right), even when the overall hair is already overly smoothed (top-right).

Applications

The method reconstructs explicit hair geometry as a dense set of polyline curves. Compared to implicit representations, the reconstruction can be easily used in other applications, such as physically-based rendering (Figure 10), simulation (Figure 11), and hair editing (Figure 12).

Figure 10: Hair re-rendering application. We render the reconstructed hair geometry using different materials (row 1) and environment lightings (row 2) with a physically-based renderer.

Figure 11: Hair simulation application. In each image pair, we demonstrate the original captured hairs (left), and the hairs deformed with quasi-static simulation at a given head pose. The simulation is performed using the industrial software Houdini.

Figure 12: Hair editing application. We perform haircut by keeping 80\% (60\%) of the original vertices for each hair strand.

Limitations

Challenges arising from dark hair appearances and extremely curly strands can cause difficulties across all stages: retrieved orientations are noisy, traced strands appear messy, and optimization struggles to effectively enhance the hair quality. Integrating prior knowledge with the flexible prior-free capture pipeline represents a promising avenue for future research. (Figure 13) shows failure cases for the method.

Figure 13: Failure cases. For the extremely complicated hairstyles, our method fails to capture all the small curls.

Conclusion

GroomCap presents a prior-free approach for capturing hair geometry from multi-view inputs. It bridges the gap between high-fidelity hair modeling and practical application needs. The success of GroomCap highlights its potential as a transformative tool in various scenarios where high-quality hair is desired.

Markdown