Papers
Topics
Authors
Recent
Search
2000 character limit reached

MVCHead: 3D Gaussian Head Avatar Generation

Updated 31 May 2026
  • MVCHead architecture is a multi-view consistent 3D Gaussian head avatar generation method that constructs detailed head models from randomly sampled 2D images without needing multi-view data.
  • It employs a hierarchical state space (HiSS) block with a novel HiBiSS mechanism and an SE(3) multi-view critic to enforce view consistency and refine anisotropic 3D Gaussian parameters.
  • The model integrates adversarial and critic losses on a large-scale FaceGS-10K dataset, achieving high-fidelity 3D head rendering via efficient, unpaired 2D supervision.

MVCHead is a multi-view consistent 3D Gaussian head avatar generation architecture designed to produce high-fidelity 3D head models using only randomly sampled 2D images, explicitly eliminating the need for multi-view data, 3D supervision, or intermediate view synthesis. Leveraging a single-shot state space model, MVCHead directly enforces multi-view consistency within the 3D representation by regressing a set of anisotropic 3D Gaussians under strict structural constraints. Its central innovations include the hierarchical refinement of Gaussians through a Hierarchical State Space (HiSS) block—featuring the novel Hierarchical Bi-directional State Scan (HiBiSS) mechanism—and a dedicated SE(3) Multi-view Critic that scores view consistency in the absence of real multi-view pairs (Chharia et al., 24 May 2026).

1. Pipeline and Representation

The core pipeline of MVCHead inputs a 512-dimensional Gaussian latent code z∼N(0,I)z \sim N(0, I) and, during training, a set of KK camera poses {Tk}k=1K\{T_k\}_{k=1}^{K} sampled from a canonical front-hemisphere rig. The model outputs a set of N=240,000N = 240{,}000 anisotropic 3D Gaussians as Sθ(z)={gi=(μi,si,qi,αi,ci)}i=1NS_\theta(z) = \{g_i = (\mu_i, s_i, q_i, \alpha_i, c_i)\}_{i=1}^N, where μi∈R3\mu_i \in \mathbb{R}^3 is the center, si∈R+3s_i \in \mathbb{R}_+^3 the scales, qi∈Hq_i \in \mathbb{H} a rotation quaternion, αi∈(0,1)\alpha_i \in (0,1) opacity, and ci∈[0,1]3c_i \in [0,1]^3 color. A differentiable renderer KK0 maps this set and a given pose KK1 to an RGB image KK2.

Processing follows a hierarchical sequence:

  1. Mapping Network: KK3 is mapped to KK4 via an MLP, following a StyleGAN-like paradigm.
  2. Initial Token Scaffold: Learnable tokens KK5 (with KK6, KK7) are lifted into a dense KK8 feature grid KK9 using multi-frequency positional encoding.
  3. Appearance Conditioning: At each resolution level {Tk}k=1K\{T_k\}_{k=1}^{K}0, {Tk}k=1K\{T_k\}_{k=1}^{K}1 is modulated by AdaIN parameters {Tk}k=1K\{T_k\}_{k=1}^{K}2 derived from {Tk}k=1K\{T_k\}_{k=1}^{K}3, facilitating disentanglement of geometry and appearance.
  4. Hierarchical State Space (HiSS) Blocks: A cascade of {Tk}k=1K\{T_k\}_{k=1}^{K}4 levels, each upsamples the Gaussian population and refines parameters via dual-mixing (self-attention and HiBiSS scans) and anchor-based per-attribute MLP heads.
  5. 3DGS Renderer: The concatenated set of all-level Gaussians {Tk}k=1K\{T_k\}_{k=1}^{K}5 is rendered into {Tk}k=1K\{T_k\}_{k=1}^{K}6 image views under the camera poses {Tk}k=1K\{T_k\}_{k=1}^{K}7.
  6. Critics: Training utilizes an adversarial texture discriminator {Tk}k=1K\{T_k\}_{k=1}^{K}8 (camera-conditioned GAN loss) and an SE(3)-equivariant Multi-view Critic {Tk}k=1K\{T_k\}_{k=1}^{K}9 to ensure cross-view consistency.

2. Hierarchical State Space (HiSS) and Parameter Regression

Each HiSS block processes the N=240,000N = 240{,}0000 feature grid, doubling or quadrupling (upsample ratio N=240,000N = 240{,}0001–N=240,000N = 240{,}0002) the number of Gaussians at each level. The dual-mixer structure operates as follows:

  • Branch 1: Multi-Head Self-Attention (with MLP and LayerNorm), designed to aggregate global, non-axis-aligned dependencies for both shape and identity cues.
  • Branch 2: The HiBiSS state-space mixer, enforcing local, axis-aligned consistency via dedicated recurrent mechanisms.

After mixing, the fused features N=240,000N = 240{,}0003 are dispatched to per-attribute MLP heads N=240,000N = 240{,}0004MLPN=240,000N = 240{,}0005, MLPN=240,000N = 240{,}0006, MLPN=240,000N = 240{,}0007, MLPN=240,000N = 240{,}0008, MLPN=240,000N = 240{,}0009, each regressing spatial offsets:

Sθ(z)={gi=(μi,si,qi,αi,ci)}i=1NS_\theta(z) = \{g_i = (\mu_i, s_i, q_i, \alpha_i, c_i)\}_{i=1}^N0

where Sθ(z)={gi=(μi,si,qi,αi,ci)}i=1NS_\theta(z) = \{g_i = (\mu_i, s_i, q_i, \alpha_i, c_i)\}_{i=1}^N1 denotes the parent anchor index and Sθ(z)={gi=(μi,si,qi,αi,ci)}i=1NS_\theta(z) = \{g_i = (\mu_i, s_i, q_i, \alpha_i, c_i)\}_{i=1}^N2 is quaternion composition in tangent space, re-normalized. The generic 2D state-space recurrence within HiSS is:

Sθ(z)={gi=(μi,si,qi,αi,ci)}i=1NS_\theta(z) = \{g_i = (\mu_i, s_i, q_i, \alpha_i, c_i)\}_{i=1}^N3

with Sθ(z)={gi=(μi,si,qi,αi,ci)}i=1NS_\theta(z) = \{g_i = (\mu_i, s_i, q_i, \alpha_i, c_i)\}_{i=1}^N4.

3. Hierarchical Bi-directional State Scan (HiBiSS)

HiBiSS extends the Mamba 1D scan into four axis-aligned 2D recurrences—Sθ(z)={gi=(μi,si,qi,αi,ci)}i=1NS_\theta(z) = \{g_i = (\mu_i, s_i, q_i, \alpha_i, c_i)\}_{i=1}^N5—targeting the axes most affected by yaw/pitch-induced view drift. For grid Sθ(z)={gi=(μi,si,qi,αi,ci)}i=1NS_\theta(z) = \{g_i = (\mu_i, s_i, q_i, \alpha_i, c_i)\}_{i=1}^N6, the horizontal forward scan operates as:

Sθ(z)={gi=(μi,si,qi,αi,ci)}i=1NS_\theta(z) = \{g_i = (\mu_i, s_i, q_i, \alpha_i, c_i)\}_{i=1}^N7

Analogous recurrences apply for the other three directions, each with learned weights. After all four scans, outputs are fused as:

Sθ(z)={gi=(μi,si,qi,αi,ci)}i=1NS_\theta(z) = \{g_i = (\mu_i, s_i, q_i, \alpha_i, c_i)\}_{i=1}^N8

where Sθ(z)={gi=(μi,si,qi,αi,ci)}i=1NS_\theta(z) = \{g_i = (\mu_i, s_i, q_i, \alpha_i, c_i)\}_{i=1}^N9 are learned fusion matrices. LayerNorm and FFN are applied before the outcome is merged with the attention branch.

This multi-directional noncausal scheme efficiently aligns local 3D features, minimizing occurrence of multi-view inconsistencies along view-sensitive axes.

4. SE(3) Multi-view Critic and Losses

The SE(3) Multi-view Critic μi∈R3\mu_i \in \mathbb{R}^30 evaluates K-tuple image sets and their corresponding poses for 3D consistency:

μi∈R3\mu_i \in \mathbb{R}^31

Its architecture comprises a ViT-style encoder (patch tokenization + self-attention) and a Geometric Transform Attention module. In GTA, queries and keys are pre-rotated by relative extrinsic matrices μi∈R3\mu_i \in \mathbb{R}^32 within μi∈R3\mu_i \in \mathbb{R}^33 to ensure equivariance under rigid transformations and invariance to camera intrinsics. Output tokens are pooled by a small MLP into a scalar logit.

Eμi∈R3\mu_i \in \mathbb{R}^34 is pre-trained to distinguish positive sets (all K renders from the same latent μi∈R3\mu_i \in \mathbb{R}^35) from negative sets (distinct latents, same poses) via the cross-entropy loss:

μi∈R3\mu_i \in \mathbb{R}^36

After convergence, the critic is frozen and used to formulate a differentiable reward for the generator:

μi∈R3\mu_i \in \mathbb{R}^37

GAN loss for per-view adversarial supervision is also imposed:

μi∈R3\mu_i \in \mathbb{R}^38

Regularization terms μi∈R3\mu_i \in \mathbb{R}^39 (penalizing sparse regions) and si∈R+3s_i \in \mathbb{R}_+^30 (penalizing centroid drift from parent anchors) further stabilize learning.

The total objective combines these components:

si∈R+3s_i \in \mathbb{R}_+^31

with typical values si∈R+3s_i \in \mathbb{R}_+^32.

5. Training, Hyperparameters, and Inference

MVCHead utilizes the following hyperparameter and training schedule:

  • Number of Gaussians: si∈R+3s_i \in \mathbb{R}_+^33 in the initial block, upsampled by si∈R+3s_i \in \mathbb{R}_+^34 per block over si∈R+3s_i \in \mathbb{R}_+^35 levels, resulting in si∈R+3s_i \in \mathbb{R}_+^36.
  • Token grid: si∈R+3s_i \in \mathbb{R}_+^37 at all levels, si∈R+3s_i \in \mathbb{R}_+^38.
  • Learning rates: Generator si∈R+3s_i \in \mathbb{R}_+^39: qi∈Hq_i \in \mathbb{H}0, discriminator qi∈Hq_i \in \mathbb{H}1: qi∈Hq_i \in \mathbb{H}2, critic qi∈Hq_i \in \mathbb{H}3: qi∈Hq_i \in \mathbb{H}4, all with Adam (qi∈Hq_i \in \mathbb{H}5).
  • Training steps: qi∈Hq_i \in \mathbb{H}6 iterations on qi∈Hq_i \in \mathbb{H}7 NVIDIA H100 GPUs (qi∈Hq_i \in \mathbb{H}83 days).
  • Rendering resolution: qi∈Hq_i \in \mathbb{H}9.
  • Critic views: αi∈(0,1)\alpha_i \in (0,1)0 per set.

The high-level training and inference procedure is codified by the following pseudocode:

αi∈(0,1)\alpha_i \in (0,1)2

6. FaceGS-10K Dataset and Downstream Applications

The FaceGS-10K dataset, released with MVCHead, comprises 10,000 high-fidelity 3D Gaussian head assets, each with 240,000 Gaussians and 24 precomputed αi∈(0,1)\alpha_i \in (0,1)1 renders covering the frontal hemisphere. These assets are intended for scalable supervision of future 3D head modeling research and for privacy-preserving digital avatar generation.

A plausible implication is that such a dataset, being produced fully automatically and without 3D or multi-view supervision, could catalyze further research in consistent 3D generation from unconstrained 2D sources. The dataset directly reflects the performance ceiling of the MVCHead architecture under unpaired 2D supervision alone (Chharia et al., 24 May 2026).


For detailed implementation, refer to the project repository and supplementary code provided at https://humansensinglab.github.io/MVCHead/. The complete methodological formulation, equations, and pipeline described above reflect the architecture as documented in (Chharia et al., 24 May 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MVCHead Architecture.