MVCHead: 3D Gaussian Head Avatar Generation
- MVCHead architecture is a multi-view consistent 3D Gaussian head avatar generation method that constructs detailed head models from randomly sampled 2D images without needing multi-view data.
- It employs a hierarchical state space (HiSS) block with a novel HiBiSS mechanism and an SE(3) multi-view critic to enforce view consistency and refine anisotropic 3D Gaussian parameters.
- The model integrates adversarial and critic losses on a large-scale FaceGS-10K dataset, achieving high-fidelity 3D head rendering via efficient, unpaired 2D supervision.
MVCHead is a multi-view consistent 3D Gaussian head avatar generation architecture designed to produce high-fidelity 3D head models using only randomly sampled 2D images, explicitly eliminating the need for multi-view data, 3D supervision, or intermediate view synthesis. Leveraging a single-shot state space model, MVCHead directly enforces multi-view consistency within the 3D representation by regressing a set of anisotropic 3D Gaussians under strict structural constraints. Its central innovations include the hierarchical refinement of Gaussians through a Hierarchical State Space (HiSS) block—featuring the novel Hierarchical Bi-directional State Scan (HiBiSS) mechanism—and a dedicated SE(3) Multi-view Critic that scores view consistency in the absence of real multi-view pairs (Chharia et al., 24 May 2026).
1. Pipeline and Representation
The core pipeline of MVCHead inputs a 512-dimensional Gaussian latent code and, during training, a set of camera poses sampled from a canonical front-hemisphere rig. The model outputs a set of anisotropic 3D Gaussians as , where is the center, the scales, a rotation quaternion, opacity, and color. A differentiable renderer 0 maps this set and a given pose 1 to an RGB image 2.
Processing follows a hierarchical sequence:
- Mapping Network: 3 is mapped to 4 via an MLP, following a StyleGAN-like paradigm.
- Initial Token Scaffold: Learnable tokens 5 (with 6, 7) are lifted into a dense 8 feature grid 9 using multi-frequency positional encoding.
- Appearance Conditioning: At each resolution level 0, 1 is modulated by AdaIN parameters 2 derived from 3, facilitating disentanglement of geometry and appearance.
- Hierarchical State Space (HiSS) Blocks: A cascade of 4 levels, each upsamples the Gaussian population and refines parameters via dual-mixing (self-attention and HiBiSS scans) and anchor-based per-attribute MLP heads.
- 3DGS Renderer: The concatenated set of all-level Gaussians 5 is rendered into 6 image views under the camera poses 7.
- Critics: Training utilizes an adversarial texture discriminator 8 (camera-conditioned GAN loss) and an SE(3)-equivariant Multi-view Critic 9 to ensure cross-view consistency.
2. Hierarchical State Space (HiSS) and Parameter Regression
Each HiSS block processes the 0 feature grid, doubling or quadrupling (upsample ratio 1–2) the number of Gaussians at each level. The dual-mixer structure operates as follows:
- Branch 1: Multi-Head Self-Attention (with MLP and LayerNorm), designed to aggregate global, non-axis-aligned dependencies for both shape and identity cues.
- Branch 2: The HiBiSS state-space mixer, enforcing local, axis-aligned consistency via dedicated recurrent mechanisms.
After mixing, the fused features 3 are dispatched to per-attribute MLP heads 4MLP5, MLP6, MLP7, MLP8, MLP9, each regressing spatial offsets:
0
where 1 denotes the parent anchor index and 2 is quaternion composition in tangent space, re-normalized. The generic 2D state-space recurrence within HiSS is:
3
with 4.
3. Hierarchical Bi-directional State Scan (HiBiSS)
HiBiSS extends the Mamba 1D scan into four axis-aligned 2D recurrences—5—targeting the axes most affected by yaw/pitch-induced view drift. For grid 6, the horizontal forward scan operates as:
7
Analogous recurrences apply for the other three directions, each with learned weights. After all four scans, outputs are fused as:
8
where 9 are learned fusion matrices. LayerNorm and FFN are applied before the outcome is merged with the attention branch.
This multi-directional noncausal scheme efficiently aligns local 3D features, minimizing occurrence of multi-view inconsistencies along view-sensitive axes.
4. SE(3) Multi-view Critic and Losses
The SE(3) Multi-view Critic 0 evaluates K-tuple image sets and their corresponding poses for 3D consistency:
1
Its architecture comprises a ViT-style encoder (patch tokenization + self-attention) and a Geometric Transform Attention module. In GTA, queries and keys are pre-rotated by relative extrinsic matrices 2 within 3 to ensure equivariance under rigid transformations and invariance to camera intrinsics. Output tokens are pooled by a small MLP into a scalar logit.
E4 is pre-trained to distinguish positive sets (all K renders from the same latent 5) from negative sets (distinct latents, same poses) via the cross-entropy loss:
6
After convergence, the critic is frozen and used to formulate a differentiable reward for the generator:
7
GAN loss for per-view adversarial supervision is also imposed:
8
Regularization terms 9 (penalizing sparse regions) and 0 (penalizing centroid drift from parent anchors) further stabilize learning.
The total objective combines these components:
1
with typical values 2.
5. Training, Hyperparameters, and Inference
MVCHead utilizes the following hyperparameter and training schedule:
- Number of Gaussians: 3 in the initial block, upsampled by 4 per block over 5 levels, resulting in 6.
- Token grid: 7 at all levels, 8.
- Learning rates: Generator 9: 0, discriminator 1: 2, critic 3: 4, all with Adam (5).
- Training steps: 6 iterations on 7 NVIDIA H100 GPUs (83 days).
- Rendering resolution: 9.
- Critic views: 0 per set.
The high-level training and inference procedure is codified by the following pseudocode:
2
6. FaceGS-10K Dataset and Downstream Applications
The FaceGS-10K dataset, released with MVCHead, comprises 10,000 high-fidelity 3D Gaussian head assets, each with 240,000 Gaussians and 24 precomputed 1 renders covering the frontal hemisphere. These assets are intended for scalable supervision of future 3D head modeling research and for privacy-preserving digital avatar generation.
A plausible implication is that such a dataset, being produced fully automatically and without 3D or multi-view supervision, could catalyze further research in consistent 3D generation from unconstrained 2D sources. The dataset directly reflects the performance ceiling of the MVCHead architecture under unpaired 2D supervision alone (Chharia et al., 24 May 2026).
For detailed implementation, refer to the project repository and supplementary code provided at https://humansensinglab.github.io/MVCHead/. The complete methodological formulation, equations, and pipeline described above reflect the architecture as documented in (Chharia et al., 24 May 2026).