Papers
Topics
Authors
Recent
Search
2000 character limit reached

SRFlowNet: Splatting Rasterization Guided Flow

Updated 17 January 2026
  • The paper introduces a novel training guidance mechanism using Gaussian splatting rasterization to generate pixel-accurate, view-dependent facial optical flow supervision.
  • The paper employs the SKFlow backbone integrated with facial-specific regularization losses, improving flow estimation by suppressing noise and reducing large-scale errors.
  • The paper demonstrates state-of-the-art performance in both optical flow accuracy and micro-expression recognition, validated on a high-resolution, multi-view SRFlow dataset.

Splatting Rasterization Guided FlowNet (SRFlowNet) is a facial optical flow model designed to estimate high-resolution, fine-grained facial motion from video frames by leveraging supervision derived from 3D Gaussian splatting rasterization. Built upon the SKFlow backbone, SRFlowNet introduces a suite of facial-specific regularization losses that effectively suppress high-frequency noise and large-scale errors, particularly in texture-less or repetitive-pattern facial regions. It is trained on the SRFlow dataset, which provides pixel-accurate, high-resolution optical flow ground truth generated by projecting 3D Gaussian splats and compositing their motion contributions via depth-sorted alpha blending. The approach enables SRFlowNet to achieve state-of-the-art accuracy in both optical flow estimation and downstream micro-expression recognition tasks, particularly in scenarios requiring the capture of subtle, high-resolution facial dynamics (Zhang et al., 10 Jan 2026).

1. Gaussian Splatting Rasterization: Generation of Optical Flow Supervision

SRFlowNet is "guided" during training by pixel-level flows derived from a process termed Gaussian splatting rasterization. A "Gaussian splat" is a 3D volumetric primitive parameterized by color cic_i, opacity αi\alpha_i, and a placement frame tied to a surface mesh triangle. In the reconstruction pipeline (e.g., GaussianAvatar), a human head is modeled as a dense cloud of such Gaussians, which deform with facial expressions.

The novel Flow Rasterizer extends standard splatting rendering by tracking the displacement of each splat's 3D center across two frames. The centers are projected into image space using computed extrinsic and perspective matrices, resulting in a per-splat pixel displacement (Δui,Δvi)(\Delta u_i, \Delta v_i). The motion field for each pixel is then composited by weighted alpha blending:

Ooptical(u,v)=i=1n[Δui,Δvi]Tαij<i(1αj)O_\mathrm{optical}(u,v) = \sum_{i=1}^n [\Delta u_i, \Delta v_i]^T \cdot \alpha_i' \cdot \prod_{j<i}(1 - \alpha_j')

This produces dense, high-fidelity, and view-dependent ground-truth facial flow fields, which form the SRFlow dataset’s optical flow supervision.

2. Network Architecture and Integration of Splatting Guidance

SRFlowNet adopts the SKFlow backbone without architectural modification. The SKFlow design consists of:

  • Encoder: A feature pyramid is constructed from input frames I1I_1, I2R3×H×WI_2 \in \mathbb{R}^{3 \times H \times W} using strided convolutions (initially 7×77 \times 7 and successive 3×33 \times 3) and skip connections. The result is a 6-level spatial pyramid with fixed channel sizes (typically 256 per level, at half-resolution downscaling).
  • Correlation Module: At each level, global all-pair feature correlations are computed using learned super-kernels, facilitating robust pixelwise matching.
  • Update Operator: An iterative GRU-like update module refines the flow estimate fif^i across nn recurrent stages (commonly n=6n=6), leveraging the current correlation, context features, and residual flow head.
  • Output Decoder: Flow predictions (u,v)R2×H×W(u,v) \in \mathbb{R}^{2 \times H \times W} are produced via bilinear upsampling and convolutional refinement.

Importantly, no special rasterization module is present within SRFlowNet itself. "Guidance" refers to supervision with SRFlow’s Gaussian-splatting-derived ground truth during training rather than architectural integration.

3. Facial-Specific Regularization Losses

Beyond the standard endpoint error (LEPEL_\mathrm{EPE}), SRFlowNet introduces four specialized regularization losses to address the unique challenges of facial flow, most notably oversmoothing and edge artifacts in low-texture facial regions. All regularizers rely on per-pixel face masks MbgM_{bg} and are applied at each update stage with a decay factor γ\gamma and weight λN=0.05\lambda_N = 0.05.

  • Total Variation Regularization (TVR): Imposes Sobel-based spatial smoothness on flow channels u,vu, v:

R(c)=1HWx,y(xc(x,y)+yc(x,y))R(c) = \frac{1}{HW} \sum_{x,y} (|\nabla_x c(x,y)| + |\nabla_y c(x,y)|)

The total TV loss is aggregated across stages: LTVRL_\mathrm{TVR}.

  • Flow Difference Regularization (FDR): Enforces axis-aligned forward difference smoothing within the facial mask, favoring conservative correction of abrupt flow transitions:

LFDR=λNi=0n1γni1[Dx(ui)Mbgx+Dy(vi)Mbgy]L_\mathrm{FDR} = \lambda_N \sum_{i=0}^{n-1} \gamma^{n-i-1} [D_x(u^i) \odot M_{bg}^x + D_y(v^i) \odot M_{bg}^y]

  • Mean Image Gradient Activation Regularization (MIGAR): Reweights the TV loss by the local image gradient, computed via averaged Sobel norms over I1I_1. This adaptively penalizes flow complexity in low-texture regions, encouraging spatially adaptive regularization.
  • Image Gradient Variance Activation Regularization (IGVAR): Similar to MIGAR, but the base reweighting value is linked to the variance of image gradients within the face mask, enhancing adaptability to different facial appearances.

A comparative summary of the loss terms appears below:

Regularization Smoothness Basis Local Adaptation Mechanism
TVR Sobel, isotropic None
FDR Axis differences Facial mask only
MIGAR Sobel, isotropic Image gradient magnitude reweighting
IGVAR Sobel, isotropic Masked image gradient variance

4. Dataset Construction and Annotation Protocol

The SRFlow dataset comprises 11,161 high-resolution frame pairs (2200×32082200 \times 3208 px) captured using multi-view setups (16 synchronized cameras) from 27 subjects sourced from NeRSemble. Subject sequences involve diverse facial actions, speech, and pose changes.

3D mesh alignments use VHAP, followed by mesh-to-Gaussian cloud reconstruction with GaussianAvatar. Synthetic camera perturbations mixture of front-facing and oblique views. The Flow Rasterizer generates dense, view-dependent, two-channel optical flow, pixel-level binary masks (face region), and all relevant camera parameters.

Dataset splits: 6,791 training pairs, 1,212 validation, 3,158 test.

5. Training Procedures and Implementation Details

All optical flow models (SRFlowNet and baselines) are trained using the following protocol:

  • Hardware: Dual RTX A6000 Ada GPUs (total 96 GB VRAM)
  • Batch size: 8
  • Input: Random 800×512800 \times 512 crops
  • Data augmentation: Random cropping only
  • Optimizer: Adam-style (SKFlow defaults)
  • Learning rate: 1.25×1041.25 \times 10^{-4}, 45 epochs without decay

For downstream micro-expression recognition (Off-TANet), composite datasets (SAMM, CASME II, SMIC) are used. Optical-strain maps computed between onset/apex frames are input at 112×112112 \times 112 crop size. Training lasts up to 200 epochs, sampling the maximum cross-validation average.

6. Quantitative Evaluation and Performance

Optical Flow Evaluation: Metrics include end-point error (EPE, px), px1/px3/px5 accuracy (fraction with flow error << 1/3/5 px), weighted F1 (F1-ALL, lower is better), and WAUC (area under the EPE-threshold curve).

Model EPE F1-ALL WAUC
Pretrained MemFlow 0.5081 3.0071 80.39%
Pretrained SKFlow 0.5361 3.2159 80.35%
SKFlow+SRFlow 0.3998 0.6722 83.83%
MemFlow+SRFlow 0.2953 0.3502 86.97%
DPFlow+SRFlow 0.3348 0.3961 88.80%

Micro-Expression Recognition (Off-TANet, Composite):

Model F1μ_{1\mu} GM_M
Baseline SKFlow 0.5906 0.3181
SKFlow+SRFlow 0.7736 0.5083
SRFlowNet-TVR 0.7781 0.5402

Relative gains: Up to 42% reduction in EPE and 48% increase in F1μF_{1\mu} demonstrate the impact of Gaussian splatting supervision and facial regularization.

7. Ablation Analysis and Empirical Observations

Ablation studies compare each regularizer in isolation (on top of SKFlow+SRFlow):

  • FDR achieves lowest EPE and best WAUC but may suppress subtle facial motion, slightly degrading micro-expression recognition.
  • MIGAR exhibits the lowest F1-ALL, indicative of strong detail preservation; however, it can over-smooth in low-texture regions.
  • TVR and IGVAR achieve the best balanced F1 and GM_M.
  • Qualitative: Retrained models, especially SRFlowNet-FDR and SRFlowNet-MIGAR, yield visually faithful flow fields, capturing coherent movement in lips, eyes, and eyebrows with minimal spurious edges.

A plausible implication is that spatially adaptive regularization, conditioned on structural image features, enables more reliable estimation in regions with low or repetitive texture while avoiding the introduction of artificial detail.

SRFlowNet, trained on SRFlow supervisory signals, consistently outperforms standard SKFlow and other pretrained benchmarks in fine-grained facial flow estimation and micro-expression analysis, establishing a new reference point for high-resolution, mask- and gradient-aware optical flow modeling in facial analysis (Zhang et al., 10 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Splatting Rasterization Guided FlowNet (SRFlowNet).