- The paper introduces a Dirichlet-constrained variational framework that models latent representations as continuous convex combinations to enhance video face restoration.
- It leverages a spatio-temporal Transformer to predict Dirichlet parameters, ensuring smooth transitions and mitigating flicker artifacts across frames.
- Experiments on VFHQ demonstrate significant improvements in temporal consistency and restoration quality compared to traditional frame-by-frame methods.
Video face restoration aims to recover high-quality facial details in video sequences that have been degraded by various factors. A major challenge in this task is maintaining temporal consistency across frames; simply applying image-based restoration methods frame-by-frame often leads to flickering artifacts. The paper "DicFace: Dirichlet-Constrained Variational Codebook Learning for Temporally Coherent Video Face Restoration" (2506.13355) addresses this by extending the Vector-Quantized Variational Autoencoder (VQ-VAE) framework, traditionally used for image generation and restoration, to video by introducing a novel Dirichlet-constrained variational latent space.
The core idea is to move away from the discrete codebook lookup used in traditional VQ-VAEs (like VQ-GAN and CodeFormer) when processing videos. Instead of assigning a single discrete codebook entry to each spatial location in the latent feature map, DicFace models the latent representation at each location as a continuous convex combination of the codebook entries. The weights for this combination are treated as a probabilistic variable drawn from a Dirichlet distribution. This continuous, probabilistic formulation allows for smoother transitions between latent codes across adjacent frames, which in turn helps mitigate temporal flicker in the restored video.
Here's a breakdown of the methodology and its implementation aspects:
- VQ-VAE Foundation: The method builds upon a VQ-VAE architecture. This typically involves:
- An encoder EL that maps the input low-quality frame x to a latent feature map z.
- A decoder DH that reconstructs a high-quality image y from a quantized or processed latent representation.
- A learned codebook c=[ck]k=1N containing N code vectors.
- Continuous Latent Space Formulation:
- Instead of hard quantization where zi,j is replaced by the nearest ck, DicFace aims to represent the latent code v^i,j at location (i,j) as a weighted sum of codebook vectors: v^i,j=∑k=1Nw^i,j,kck.
- The weights w^i,j=[w^i,j,1,…,w^i,j,N] must sum to 1 and be non-negative, naturally forming a point on the (N−1)-dimensional simplex.
- This weight vector w^i,j is modeled as being sampled from a Dirichlet distribution: w^i,j∼Dir(α^i,j), where α^i,j are the concentration parameters for location (i,j).
- Spatio-Temporal Transformer for Parameter Prediction:
- A key component is a spatio-temporal Transformer network H. This Transformer takes the sequence of latent feature maps {zr}r=1R from R consecutive input frames.
- It uses alternating spatial and temporal self-attention blocks to capture dependencies within frames and across frames.
- The output of the Transformer is then linearly projected to predict the Dirichlet parameters {α^r}r=1R for each frame r and each spatial location (i,j).
- Variational Inference and ELBO Loss:
- The model is trained variationally to maximize the Evidence Lower Bound (ELBO). The ELBO objective encourages the predicted distribution qθ(w^∣x) (the Dirichlet distribution parameterized by α^) to be close to a prior distribution pθ(w^) (also a Dirichlet distribution with hyper-parameter α) while minimizing the expected reconstruction error Eqθ(w^∣x)[logpθ(y∣x,w^)].
- The reconstruction error term is modeled using a Laplacian distribution assumption, leading to an L1-like loss.
- The overall training loss combines the ELBO term and the LPIPS perceptual loss for better visual quality: Ltotal=λ1LELBO+λ2LLPIPS.
- Sampling from the Dirichlet distribution for the expected reconstruction term is handled using reparameterization tricks or Monte Carlo sampling to allow gradients to flow.
- Training Strategy:
- The model is trained on video face datasets like VFHQ.
- Training involves progressively unfreezing model components: encoder/decoder first, then the Transformer, and potentially fine-tuning the codebook.
- The Dirichlet prior hyper-parameter α can be tuned; small values (α→0) encourage the weights to be sparse (close to one-hot, resembling discrete assignment), while larger values lead to smoother weight distributions (more uniform averages). The ablation studies show that an intermediate value like α=1.0 works well, balancing sparsity and smoothness.
- Inference:
- During inference, a sliding window approach is used. The network processes a fixed number of frames (e.g., 5) at a time.
- Padding (e.g., repeating start/end frames) is used for sequences shorter than the window size.
- The prediction for the central frame of the window is typically taken as the output, with a stride of 1 frame to ensure temporal overlap and consistency in the final video.
Practical Implementation Considerations:
- Architecture: Implementing the spatio-temporal Transformer requires handling multi-dimensional tensors representing sequences of image feature maps. Alternating spatial and temporal attention can be implemented by reshaping the input tensor appropriately for each attention type.
- Dirichlet Distribution and Sampling: Libraries like PyTorch or TensorFlow provide implementations for the Dirichlet distribution and sampling. The reparameterization trick for the Dirichlet distribution involves sampling from Gamma distributions and normalizing, which is differentiable.
- Loss Function: The KL divergence term for Dirichlet distributions has a closed-form solution involving the digamma function (ψ), which is available in deep learning libraries. The expected reconstruction term requires sampling from the Dirichlet distribution and computing the reconstruction loss.
- Codebook Management: The codebook is a learned parameter matrix. Its size N impacts the model's capacity and computational cost. The paper explores sizes like 256, 512, 1024.
- Computational Resources: The spatio-temporal Transformer can be computationally intensive, especially for longer sequences or higher feature dimensions. The sliding window inference helps manage this for long videos but introduces overhead.
- Data Requirements: Training requires a large dataset of high-quality and degraded video face pairs, like VFHQ. Generating degraded versions requires a realistic degradation pipeline.
- Deployment: The model can be deployed using standard deep learning inference frameworks. The sliding window approach means that real-time processing might require optimizing the model and inference pipeline.
Concrete Examples and Applications:
- Blind Video Face Restoration: Recovering details from videos with unknown and complex degradations (compression artifacts, noise, blur, low resolution). DicFace demonstrates state-of-the-art quantitative and qualitative results on the VFHQ dataset for this task, significantly improving temporal stability (lower TLME and competitive FVD) compared to previous methods.
- Video Face Inpainting: Filling in missing regions (e.g., occlusions) in face videos. The continuous latent space helps generate coherent textures and structures that smoothly integrate with surrounding frames.
- Video Face Colorization: Adding color to grayscale face videos. The model learns to predict realistic and temporally consistent colors.
The paper demonstrates that reformulating the latent space of VQ-VAEs with a Dirichlet constraint and variational inference is a powerful strategy for adapting image priors to video tasks while effectively addressing the critical issue of temporal consistency. The open-sourced code further facilitates the practical application and exploration of this approach.