Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DicFace: Dirichlet-Constrained Variational Codebook Learning for Temporally Coherent Video Face Restoration (2506.13355v1)

Published 16 Jun 2025 in cs.CV

Abstract: Video face restoration faces a critical challenge in maintaining temporal consistency while recovering fine facial details from degraded inputs. This paper presents a novel approach that extends Vector-Quantized Variational Autoencoders (VQ-VAEs), pretrained on static high-quality portraits, into a video restoration framework through variational latent space modeling. Our key innovation lies in reformulating discrete codebook representations as Dirichlet-distributed continuous variables, enabling probabilistic transitions between facial features across frames. A spatio-temporal Transformer architecture jointly models inter-frame dependencies and predicts latent distributions, while a Laplacian-constrained reconstruction loss combined with perceptual (LPIPS) regularization enhances both pixel accuracy and visual quality. Comprehensive evaluations on blind face restoration, video inpainting, and facial colorization tasks demonstrate state-of-the-art performance. This work establishes an effective paradigm for adapting intensive image priors, pretrained on high-quality images, to video restoration while addressing the critical challenge of flicker artifacts. The source code has been open-sourced and is available at https://github.com/fudan-generative-vision/DicFace.

Summary

  • The paper introduces a Dirichlet-constrained variational framework that models latent representations as continuous convex combinations to enhance video face restoration.
  • It leverages a spatio-temporal Transformer to predict Dirichlet parameters, ensuring smooth transitions and mitigating flicker artifacts across frames.
  • Experiments on VFHQ demonstrate significant improvements in temporal consistency and restoration quality compared to traditional frame-by-frame methods.

Video face restoration aims to recover high-quality facial details in video sequences that have been degraded by various factors. A major challenge in this task is maintaining temporal consistency across frames; simply applying image-based restoration methods frame-by-frame often leads to flickering artifacts. The paper "DicFace: Dirichlet-Constrained Variational Codebook Learning for Temporally Coherent Video Face Restoration" (2506.13355) addresses this by extending the Vector-Quantized Variational Autoencoder (VQ-VAE) framework, traditionally used for image generation and restoration, to video by introducing a novel Dirichlet-constrained variational latent space.

The core idea is to move away from the discrete codebook lookup used in traditional VQ-VAEs (like VQ-GAN and CodeFormer) when processing videos. Instead of assigning a single discrete codebook entry to each spatial location in the latent feature map, DicFace models the latent representation at each location as a continuous convex combination of the codebook entries. The weights for this combination are treated as a probabilistic variable drawn from a Dirichlet distribution. This continuous, probabilistic formulation allows for smoother transitions between latent codes across adjacent frames, which in turn helps mitigate temporal flicker in the restored video.

Here's a breakdown of the methodology and its implementation aspects:

  1. VQ-VAE Foundation: The method builds upon a VQ-VAE architecture. This typically involves:
    • An encoder EL\mathcal{E}_L that maps the input low-quality frame x\mathbf{x} to a latent feature map z\mathbf{z}.
    • A decoder DH\mathcal{D}_H that reconstructs a high-quality image y\mathbf{y} from a quantized or processed latent representation.
    • A learned codebook c=[ck]k=1N\mathbf{c} = [c_k]_{k=1}^N containing NN code vectors.
  2. Continuous Latent Space Formulation:
    • Instead of hard quantization where zi,j\mathbf{z}_{i,j} is replaced by the nearest ckc_k, DicFace aims to represent the latent code v^i,j\hat{v}_{i,j} at location (i,j)(i,j) as a weighted sum of codebook vectors: v^i,j=k=1Nw^i,j,kck\hat{v}_{i,j} = \sum_{k=1}^N \hat{w}_{i,j,k} c_k.
    • The weights w^i,j=[w^i,j,1,,w^i,j,N]\hat{\mathbf{w}}_{i,j} = [\hat{w}_{i,j,1}, \dots, \hat{w}_{i,j,N}] must sum to 1 and be non-negative, naturally forming a point on the (N1)(N-1)-dimensional simplex.
    • This weight vector w^i,j\hat{\mathbf{w}}_{i,j} is modeled as being sampled from a Dirichlet distribution: w^i,jDir(α^i,j)\hat{\mathbf{w}}_{i,j} \sim \mathtt{Dir}(\hat{\boldsymbol{\alpha}}_{i,j}), where α^i,j\hat{\boldsymbol{\alpha}}_{i,j} are the concentration parameters for location (i,j)(i,j).
  3. Spatio-Temporal Transformer for Parameter Prediction:
    • A key component is a spatio-temporal Transformer network H\mathcal{H}. This Transformer takes the sequence of latent feature maps {zr}r=1R\{\mathbf{z}^r\}_{r=1}^R from RR consecutive input frames.
    • It uses alternating spatial and temporal self-attention blocks to capture dependencies within frames and across frames.
    • The output of the Transformer is then linearly projected to predict the Dirichlet parameters {α^r}r=1R\{\hat{\boldsymbol{\alpha}}^r\}_{r=1}^R for each frame rr and each spatial location (i,j)(i,j).
  4. Variational Inference and ELBO Loss:
    • The model is trained variationally to maximize the Evidence Lower Bound (ELBO). The ELBO objective encourages the predicted distribution qθ(w^x)q_{\theta}(\hat{\mathbf{w}} | \mathbf{x}) (the Dirichlet distribution parameterized by α^\hat{\boldsymbol{\alpha}}) to be close to a prior distribution pθ(w^)p_{\theta}(\hat{\mathbf{w}}) (also a Dirichlet distribution with hyper-parameter α\alpha) while minimizing the expected reconstruction error Eqθ(w^x)[logpθ(yx,w^)]\mathbb{E}_{q_{\theta}(\hat{\mathbf{w}} | \mathbf{x})}[\log p_{\theta}(\mathbf{y} | \mathbf{x}, \hat{\mathbf{w}})].
    • The reconstruction error term is modeled using a Laplacian distribution assumption, leading to an L1-like loss.
    • The overall training loss combines the ELBO term and the LPIPS perceptual loss for better visual quality: Ltotal=λ1LELBO+λ2LLPIPS\mathcal{L}_{\text{total}} = \lambda_1 \mathcal{L}_{\mathtt{ELBO}} + \lambda_2 \mathcal{L}_{\mathtt{LPIPS}}.
    • Sampling from the Dirichlet distribution for the expected reconstruction term is handled using reparameterization tricks or Monte Carlo sampling to allow gradients to flow.
  5. Training Strategy:
    • The model is trained on video face datasets like VFHQ.
    • Training involves progressively unfreezing model components: encoder/decoder first, then the Transformer, and potentially fine-tuning the codebook.
    • The Dirichlet prior hyper-parameter α\alpha can be tuned; small values (α0\alpha \to 0) encourage the weights to be sparse (close to one-hot, resembling discrete assignment), while larger values lead to smoother weight distributions (more uniform averages). The ablation studies show that an intermediate value like α=1.0\alpha=1.0 works well, balancing sparsity and smoothness.
  6. Inference:
    • During inference, a sliding window approach is used. The network processes a fixed number of frames (e.g., 5) at a time.
    • Padding (e.g., repeating start/end frames) is used for sequences shorter than the window size.
    • The prediction for the central frame of the window is typically taken as the output, with a stride of 1 frame to ensure temporal overlap and consistency in the final video.

Practical Implementation Considerations:

  • Architecture: Implementing the spatio-temporal Transformer requires handling multi-dimensional tensors representing sequences of image feature maps. Alternating spatial and temporal attention can be implemented by reshaping the input tensor appropriately for each attention type.
  • Dirichlet Distribution and Sampling: Libraries like PyTorch or TensorFlow provide implementations for the Dirichlet distribution and sampling. The reparameterization trick for the Dirichlet distribution involves sampling from Gamma distributions and normalizing, which is differentiable.
  • Loss Function: The KL divergence term for Dirichlet distributions has a closed-form solution involving the digamma function (ψ\psi), which is available in deep learning libraries. The expected reconstruction term requires sampling from the Dirichlet distribution and computing the reconstruction loss.
  • Codebook Management: The codebook is a learned parameter matrix. Its size NN impacts the model's capacity and computational cost. The paper explores sizes like 256, 512, 1024.
  • Computational Resources: The spatio-temporal Transformer can be computationally intensive, especially for longer sequences or higher feature dimensions. The sliding window inference helps manage this for long videos but introduces overhead.
  • Data Requirements: Training requires a large dataset of high-quality and degraded video face pairs, like VFHQ. Generating degraded versions requires a realistic degradation pipeline.
  • Deployment: The model can be deployed using standard deep learning inference frameworks. The sliding window approach means that real-time processing might require optimizing the model and inference pipeline.

Concrete Examples and Applications:

  • Blind Video Face Restoration: Recovering details from videos with unknown and complex degradations (compression artifacts, noise, blur, low resolution). DicFace demonstrates state-of-the-art quantitative and qualitative results on the VFHQ dataset for this task, significantly improving temporal stability (lower TLME and competitive FVD) compared to previous methods.
  • Video Face Inpainting: Filling in missing regions (e.g., occlusions) in face videos. The continuous latent space helps generate coherent textures and structures that smoothly integrate with surrounding frames.
  • Video Face Colorization: Adding color to grayscale face videos. The model learns to predict realistic and temporally consistent colors.

The paper demonstrates that reformulating the latent space of VQ-VAEs with a Dirichlet constraint and variational inference is a powerful strategy for adapting image priors to video tasks while effectively addressing the critical issue of temporal consistency. The open-sourced code further facilitates the practical application and exploration of this approach.

X Twitter Logo Streamline Icon: https://streamlinehq.com