SRFlowNet: Splatting Rasterization Guided Flow
- The paper introduces a novel training guidance mechanism using Gaussian splatting rasterization to generate pixel-accurate, view-dependent facial optical flow supervision.
- The paper employs the SKFlow backbone integrated with facial-specific regularization losses, improving flow estimation by suppressing noise and reducing large-scale errors.
- The paper demonstrates state-of-the-art performance in both optical flow accuracy and micro-expression recognition, validated on a high-resolution, multi-view SRFlow dataset.
Splatting Rasterization Guided FlowNet (SRFlowNet) is a facial optical flow model designed to estimate high-resolution, fine-grained facial motion from video frames by leveraging supervision derived from 3D Gaussian splatting rasterization. Built upon the SKFlow backbone, SRFlowNet introduces a suite of facial-specific regularization losses that effectively suppress high-frequency noise and large-scale errors, particularly in texture-less or repetitive-pattern facial regions. It is trained on the SRFlow dataset, which provides pixel-accurate, high-resolution optical flow ground truth generated by projecting 3D Gaussian splats and compositing their motion contributions via depth-sorted alpha blending. The approach enables SRFlowNet to achieve state-of-the-art accuracy in both optical flow estimation and downstream micro-expression recognition tasks, particularly in scenarios requiring the capture of subtle, high-resolution facial dynamics (Zhang et al., 10 Jan 2026).
1. Gaussian Splatting Rasterization: Generation of Optical Flow Supervision
SRFlowNet is "guided" during training by pixel-level flows derived from a process termed Gaussian splatting rasterization. A "Gaussian splat" is a 3D volumetric primitive parameterized by color , opacity , and a placement frame tied to a surface mesh triangle. In the reconstruction pipeline (e.g., GaussianAvatar), a human head is modeled as a dense cloud of such Gaussians, which deform with facial expressions.
The novel Flow Rasterizer extends standard splatting rendering by tracking the displacement of each splat's 3D center across two frames. The centers are projected into image space using computed extrinsic and perspective matrices, resulting in a per-splat pixel displacement . The motion field for each pixel is then composited by weighted alpha blending:
This produces dense, high-fidelity, and view-dependent ground-truth facial flow fields, which form the SRFlow dataset’s optical flow supervision.
2. Network Architecture and Integration of Splatting Guidance
SRFlowNet adopts the SKFlow backbone without architectural modification. The SKFlow design consists of:
- Encoder: A feature pyramid is constructed from input frames , using strided convolutions (initially and successive ) and skip connections. The result is a 6-level spatial pyramid with fixed channel sizes (typically 256 per level, at half-resolution downscaling).
- Correlation Module: At each level, global all-pair feature correlations are computed using learned super-kernels, facilitating robust pixelwise matching.
- Update Operator: An iterative GRU-like update module refines the flow estimate across recurrent stages (commonly ), leveraging the current correlation, context features, and residual flow head.
- Output Decoder: Flow predictions are produced via bilinear upsampling and convolutional refinement.
Importantly, no special rasterization module is present within SRFlowNet itself. "Guidance" refers to supervision with SRFlow’s Gaussian-splatting-derived ground truth during training rather than architectural integration.
3. Facial-Specific Regularization Losses
Beyond the standard endpoint error (), SRFlowNet introduces four specialized regularization losses to address the unique challenges of facial flow, most notably oversmoothing and edge artifacts in low-texture facial regions. All regularizers rely on per-pixel face masks and are applied at each update stage with a decay factor and weight .
- Total Variation Regularization (TVR): Imposes Sobel-based spatial smoothness on flow channels :
The total TV loss is aggregated across stages: .
- Flow Difference Regularization (FDR): Enforces axis-aligned forward difference smoothing within the facial mask, favoring conservative correction of abrupt flow transitions:
- Mean Image Gradient Activation Regularization (MIGAR): Reweights the TV loss by the local image gradient, computed via averaged Sobel norms over . This adaptively penalizes flow complexity in low-texture regions, encouraging spatially adaptive regularization.
- Image Gradient Variance Activation Regularization (IGVAR): Similar to MIGAR, but the base reweighting value is linked to the variance of image gradients within the face mask, enhancing adaptability to different facial appearances.
A comparative summary of the loss terms appears below:
| Regularization | Smoothness Basis | Local Adaptation Mechanism |
|---|---|---|
| TVR | Sobel, isotropic | None |
| FDR | Axis differences | Facial mask only |
| MIGAR | Sobel, isotropic | Image gradient magnitude reweighting |
| IGVAR | Sobel, isotropic | Masked image gradient variance |
4. Dataset Construction and Annotation Protocol
The SRFlow dataset comprises 11,161 high-resolution frame pairs ( px) captured using multi-view setups (16 synchronized cameras) from 27 subjects sourced from NeRSemble. Subject sequences involve diverse facial actions, speech, and pose changes.
3D mesh alignments use VHAP, followed by mesh-to-Gaussian cloud reconstruction with GaussianAvatar. Synthetic camera perturbations mixture of front-facing and oblique views. The Flow Rasterizer generates dense, view-dependent, two-channel optical flow, pixel-level binary masks (face region), and all relevant camera parameters.
Dataset splits: 6,791 training pairs, 1,212 validation, 3,158 test.
5. Training Procedures and Implementation Details
All optical flow models (SRFlowNet and baselines) are trained using the following protocol:
- Hardware: Dual RTX A6000 Ada GPUs (total 96 GB VRAM)
- Batch size: 8
- Input: Random crops
- Data augmentation: Random cropping only
- Optimizer: Adam-style (SKFlow defaults)
- Learning rate: , 45 epochs without decay
For downstream micro-expression recognition (Off-TANet), composite datasets (SAMM, CASME II, SMIC) are used. Optical-strain maps computed between onset/apex frames are input at crop size. Training lasts up to 200 epochs, sampling the maximum cross-validation average.
6. Quantitative Evaluation and Performance
Optical Flow Evaluation: Metrics include end-point error (EPE, px), px1/px3/px5 accuracy (fraction with flow error 1/3/5 px), weighted F1 (F1-ALL, lower is better), and WAUC (area under the EPE-threshold curve).
| Model | EPE | F1-ALL | WAUC |
|---|---|---|---|
| Pretrained MemFlow | 0.5081 | 3.0071 | 80.39% |
| Pretrained SKFlow | 0.5361 | 3.2159 | 80.35% |
| SKFlow+SRFlow | 0.3998 | 0.6722 | 83.83% |
| MemFlow+SRFlow | 0.2953 | 0.3502 | 86.97% |
| DPFlow+SRFlow | 0.3348 | 0.3961 | 88.80% |
Micro-Expression Recognition (Off-TANet, Composite):
| Model | F | G |
|---|---|---|
| Baseline SKFlow | 0.5906 | 0.3181 |
| SKFlow+SRFlow | 0.7736 | 0.5083 |
| SRFlowNet-TVR | 0.7781 | 0.5402 |
Relative gains: Up to 42% reduction in EPE and 48% increase in demonstrate the impact of Gaussian splatting supervision and facial regularization.
7. Ablation Analysis and Empirical Observations
Ablation studies compare each regularizer in isolation (on top of SKFlow+SRFlow):
- FDR achieves lowest EPE and best WAUC but may suppress subtle facial motion, slightly degrading micro-expression recognition.
- MIGAR exhibits the lowest F1-ALL, indicative of strong detail preservation; however, it can over-smooth in low-texture regions.
- TVR and IGVAR achieve the best balanced F1 and G.
- Qualitative: Retrained models, especially SRFlowNet-FDR and SRFlowNet-MIGAR, yield visually faithful flow fields, capturing coherent movement in lips, eyes, and eyebrows with minimal spurious edges.
A plausible implication is that spatially adaptive regularization, conditioned on structural image features, enables more reliable estimation in regions with low or repetitive texture while avoiding the introduction of artificial detail.
SRFlowNet, trained on SRFlow supervisory signals, consistently outperforms standard SKFlow and other pretrained benchmarks in fine-grained facial flow estimation and micro-expression analysis, establishing a new reference point for high-resolution, mask- and gradient-aware optical flow modeling in facial analysis (Zhang et al., 10 Jan 2026).