Hierarchical Bi-directional State Scan (HiBiSS)
- The paper’s main contribution is the introduction of HiBiSS, a bi-directional SSM scan that overcomes unidirectional limitations by enforcing axis-aligned smoothing.
- HiBiSS employs four coupled directional recurrences within each HiSS block to fuse self-attention outputs with state-space mixers, ensuring multi-view consistency.
- Empirical results reveal that HiBiSS significantly improves FID and multi-view error metrics in 3D head regression compared to unidirectional scan variants.
Hierarchical Bi-directional State Scan (HiBiSS) is a specialized State Space Model (SSM) scan architecture introduced in the context of single-shot 3D Gaussian head avatar regression, as developed in the MVCHead framework for multi-view-consistent 3D generative modeling without the use of multi-view supervision or intermediate view synthesis (Chharia et al., 24 May 2026). HiBiSS constitutes the principal architectural innovation within each Hierarchical State Space (HiSS) block, systematically addressing both the spatial anisotropies and directional dependencies associated with multi-view consistency.
1. Motivation and Core Principles
HiBiSS was designed to resolve the limitations of unidirectional recurrent scans, such as those originally adopted in the Mamba SSM architecture, which are restricted to causal left-to-right propagation. In the domain of 3D head regression, this restriction inhibits the communication of information along the vertical axis, resulting in insufficient integration of global context and suboptimal handling of multi-view inconsistencies, particularly yaw-induced horizontal drift and pitch-induced vertical drift. By introducing coupled, bi-directional 2D recurrences explicitly aligned with these axes (rightward, leftward, downward, upward), HiBiSS enforces axis-aligned smoothing and cross-row/column coherence to directly attenuate the principal directions of view-dependent drift.
2. HiSS Block Architecture and HiBiSS Integration
Each HiSS block operates at a specific resolution level in a coarse-to-fine hierarchy, processing an H×W×d feature grid . Two parallel feature mixers—Self-Attention + MLP and the State-Space Mixer (HiBiSS)—process the input in tandem. After separate processing, the outputs are fused, typically via summation or concatenation with subsequent linear transformation. Per-attribute MLP heads then regress Gaussian parameter offsets from coarser level anchors. The integrated pipeline for each HiSS block is as follows:
3
3. Hierarchical Bi-directional Scan Algorithm and Mathematical Formulation
HiBiSS executes four coupled SSM scans per block—one for each direction (→, ←, ↓, ↑)—by maintaining separate hidden state tensors. The update and output equations per direction are as follows:
Horizontal Forward (→):
- Hidden state:
- Output:
Vertical Forward (↓):
- Hidden state:
- Output:
Directional Output Fusion:
- Aggregate the outputs:
All directions are scanned and fused for each spatial location, ensuring full axis-aligned context propagation.
4. Data-Flow and Implementation Workflow
The data flow through each HiBiSS block is structured as follows:
- Grid Projection: Linear projection maps the input token matrix to the grid .
- Directional SSM Scans: The four directional scans (HiBiSS) are performed on , computing outputs 0.
- Directional Fusion: The outputs are fused pointwise to form 1.
- Token Layout Restoration: 2 is projected back to the token layout, residual-added to 3, and layer-normalized.
- Feed-forward Output: The normalized result is fed to downstream MLP, attention heads, or subsequent HiSS blocks.
5. Training Paradigm and Interoperation with MVCHead
HiBiSS operates at each resolution level 4 to 5 in the HiSS hierarchy on fixed spatial grids (e.g., 32×32 or 64×64). Hidden dimension 6 typically ranges between 256–512. Structured SSM kernels 7 employ diagonal plus low-rank “HiPPO” parameterizations for computational efficiency. Each directional scan incurs 8 complexity; all four together approximately quadruple the cost relative to a single SSM, resulting in a runtime of 1–2 ms per block on an H100 GPU at 9.
HiBiSS is jointly trained with the SE(3) Multi-view Critic, which evaluates the rendered consistency of regressed 3D Gaussians across sampled transformations 0. The multi-view consistency loss is:
1
The overall objective blends this with an adversarial loss, a KNN shape loss, and additional contrastive regularization:
2
During inference, only HiBiSS forward passes are executed, ensuring real-time single-shot 3D Gaussian regression.
6. Comparative Evaluation and Empirical Results
Ablation studies underscore HiBiSS’s efficacy in enforcing multi-view geometric consistency and image realism. On the FFHQ-C 512×512 benchmark, the following Fréchet Inception Distance (FID) and Multi-View Error in 3D Reconstruction (MEt3R) were observed:
| Model Variant | FID | MEt3R |
|---|---|---|
| MVCHead (HiBiSS) | 3.94 | 0.2620 |
| – w/o HiBiSS (unidirectional scan) | 4.78 | 0.2873 |
| – w/o entire HiSS state-space (no SS2D) | 5.28 | 0.2948 |
These results indicate that the axis-aligned, bidirectional recurrence imposed by HiBiSS confers measurable advantages in both visual fidelity and multi-view geometric consistency over strictly unidirectional or SSM-absent variants.
7. Architectural Significance and Interactions
HiBiSS distinguishes itself from the original Mamba SSM scan—which is limited to left-to-right (causal) recurrence—by guaranteeing both horizontal and vertical propagation through four coupled recurrences with shared parameter structures. The state-space machinery coexists with parallel self-attention mixers within the HiSS block, each targeting complementary modeling objectives: HiBiSS enforces axis-aligned smoothness and pose-aware feature fusion, whereas self-attention captures global facial semantics. Gradients from the SE(3) Multi-view Critic propagate through the renderer, regressed Gaussian parameters, and the HiBiSS blocks, biasing SSM kernel parameters toward axis-aligned smoothness that reduces cross-view inconsistencies. A plausible implication is that this synergy yields robust pose-aware anisotropic smoothing without requiring explicit multi-view supervision or intermediate 2D view generation.
For implementation details, explicit pseudocode, ablation protocols, and released datasets, see the original MVCHead publication (Chharia et al., 24 May 2026).