DicFace: Dirichlet-Constrained Variational Codebook Learning for Temporally Coherent Video Face Restoration (2506.13355v1)

Published 16 Jun 2025 in cs.CV

Abstract: Video face restoration faces a critical challenge in maintaining temporal consistency while recovering fine facial details from degraded inputs. This paper presents a novel approach that extends Vector-Quantized Variational Autoencoders (VQ-VAEs), pretrained on static high-quality portraits, into a video restoration framework through variational latent space modeling. Our key innovation lies in reformulating discrete codebook representations as Dirichlet-distributed continuous variables, enabling probabilistic transitions between facial features across frames. A spatio-temporal Transformer architecture jointly models inter-frame dependencies and predicts latent distributions, while a Laplacian-constrained reconstruction loss combined with perceptual (LPIPS) regularization enhances both pixel accuracy and visual quality. Comprehensive evaluations on blind face restoration, video inpainting, and facial colorization tasks demonstrate state-of-the-art performance. This work establishes an effective paradigm for adapting intensive image priors, pretrained on high-quality images, to video restoration while addressing the critical challenge of flicker artifacts. The source code has been open-sourced and is available at https://github.com/fudan-generative-vision/DicFace.

Summary

The paper introduces a Dirichlet-constrained variational framework that models latent representations as continuous convex combinations to enhance video face restoration.
It leverages a spatio-temporal Transformer to predict Dirichlet parameters, ensuring smooth transitions and mitigating flicker artifacts across frames.
Experiments on VFHQ demonstrate significant improvements in temporal consistency and restoration quality compared to traditional frame-by-frame methods.

Video face restoration aims to recover high-quality facial details in video sequences that have been degraded by various factors. A major challenge in this task is maintaining temporal consistency across frames; simply applying image-based restoration methods frame-by-frame often leads to flickering artifacts. The paper "DicFace: Dirichlet-Constrained Variational Codebook Learning for Temporally Coherent Video Face Restoration" (2506.13355) addresses this by extending the Vector-Quantized Variational Autoencoder (VQ-VAE) framework, traditionally used for image generation and restoration, to video by introducing a novel Dirichlet-constrained variational latent space.

The core idea is to move away from the discrete codebook lookup used in traditional VQ-VAEs (like VQ-GAN and CodeFormer) when processing videos. Instead of assigning a single discrete codebook entry to each spatial location in the latent feature map, DicFace models the latent representation at each location as a continuous convex combination of the codebook entries. The weights for this combination are treated as a probabilistic variable drawn from a Dirichlet distribution. This continuous, probabilistic formulation allows for smoother transitions between latent codes across adjacent frames, which in turn helps mitigate temporal flicker in the restored video.

Here's a breakdown of the methodology and its implementation aspects:

VQ-VAE Foundation: The method builds upon a VQ-VAE architecture. This typically involves:
- An encoder $\mathcal{E}_L$ that maps the input low-quality frame $\mathbf{x}$ to a latent feature map $\mathbf{z}$ .
- A decoder $\mathcal{D}_H$ that reconstructs a high-quality image $\mathbf{y}$ from a quantized or processed latent representation.
- A learned codebook $\mathbf{c} = [c_k]_{k=1}^N$ containing $N$ code vectors.
Continuous Latent Space Formulation:
- Instead of hard quantization where $\mathbf{z}_{i,j}$ is replaced by the nearest $c_k$ , DicFace aims to represent the latent code $\hat{v}_{i,j}$ at location $(i,j)$ as a weighted sum of codebook vectors: $\hat{v}_{i,j} = \sum_{k=1}^N \hat{w}_{i,j,k} c_k$ .
- The weights $\hat{\mathbf{w}}_{i,j} = [\hat{w}_{i,j,1}, \dots, \hat{w}_{i,j,N}]$ must sum to 1 and be non-negative, naturally forming a point on the $(N-1)$ -dimensional simplex.
- This weight vector $\hat{\mathbf{w}}_{i,j}$ is modeled as being sampled from a Dirichlet distribution: $\hat{\mathbf{w}}_{i,j} \sim \mathtt{Dir}(\hat{\boldsymbol{\alpha}}_{i,j})$ , where $\hat{\boldsymbol{\alpha}}_{i,j}$ are the concentration parameters for location $(i,j)$ .
Spatio-Temporal Transformer for Parameter Prediction:
- A key component is a spatio-temporal Transformer network $\mathcal{H}$ . This Transformer takes the sequence of latent feature maps $\{\mathbf{z}^r\}_{r=1}^R$ from $R$ consecutive input frames.
- It uses alternating spatial and temporal self-attention blocks to capture dependencies within frames and across frames.
- The output of the Transformer is then linearly projected to predict the Dirichlet parameters $\{\hat{\boldsymbol{\alpha}}^r\}_{r=1}^R$ for each frame $r$ and each spatial location $(i,j)$ .
Variational Inference and ELBO Loss:
- The model is trained variationally to maximize the Evidence Lower Bound (ELBO). The ELBO objective encourages the predicted distribution $q_{\theta}(\hat{\mathbf{w}} | \mathbf{x})$ (the Dirichlet distribution parameterized by $\hat{\boldsymbol{\alpha}}$ ) to be close to a prior distribution $p_{\theta}(\hat{\mathbf{w}})$ (also a Dirichlet distribution with hyper-parameter $\alpha$ ) while minimizing the expected reconstruction error $\mathbb{E}_{q_{\theta}(\hat{\mathbf{w}} | \mathbf{x})}[\log p_{\theta}(\mathbf{y} | \mathbf{x}, \hat{\mathbf{w}})]$ .
- The reconstruction error term is modeled using a Laplacian distribution assumption, leading to an L1-like loss.
- The overall training loss combines the ELBO term and the LPIPS perceptual loss for better visual quality: $\mathcal{L}_{\text{total}} = \lambda_1 \mathcal{L}_{\mathtt{ELBO}} + \lambda_2 \mathcal{L}_{\mathtt{LPIPS}}$ .
- Sampling from the Dirichlet distribution for the expected reconstruction term is handled using reparameterization tricks or Monte Carlo sampling to allow gradients to flow.
Training Strategy:
- The model is trained on video face datasets like VFHQ.
- Training involves progressively unfreezing model components: encoder/decoder first, then the Transformer, and potentially fine-tuning the codebook.
- The Dirichlet prior hyper-parameter $\alpha$ can be tuned; small values ( $\alpha \to 0$ ) encourage the weights to be sparse (close to one-hot, resembling discrete assignment), while larger values lead to smoother weight distributions (more uniform averages). The ablation studies show that an intermediate value like $\alpha=1.0$ works well, balancing sparsity and smoothness.
Inference:
- During inference, a sliding window approach is used. The network processes a fixed number of frames (e.g., 5) at a time.
- Padding (e.g., repeating start/end frames) is used for sequences shorter than the window size.
- The prediction for the central frame of the window is typically taken as the output, with a stride of 1 frame to ensure temporal overlap and consistency in the final video.

Practical Implementation Considerations:

Architecture: Implementing the spatio-temporal Transformer requires handling multi-dimensional tensors representing sequences of image feature maps. Alternating spatial and temporal attention can be implemented by reshaping the input tensor appropriately for each attention type.
Dirichlet Distribution and Sampling: Libraries like PyTorch or TensorFlow provide implementations for the Dirichlet distribution and sampling. The reparameterization trick for the Dirichlet distribution involves sampling from Gamma distributions and normalizing, which is differentiable.
Loss Function: The KL divergence term for Dirichlet distributions has a closed-form solution involving the digamma function ( $\psi$ ), which is available in deep learning libraries. The expected reconstruction term requires sampling from the Dirichlet distribution and computing the reconstruction loss.
Codebook Management: The codebook is a learned parameter matrix. Its size $N$ impacts the model's capacity and computational cost. The paper explores sizes like 256, 512, 1024.
Computational Resources: The spatio-temporal Transformer can be computationally intensive, especially for longer sequences or higher feature dimensions. The sliding window inference helps manage this for long videos but introduces overhead.
Data Requirements: Training requires a large dataset of high-quality and degraded video face pairs, like VFHQ. Generating degraded versions requires a realistic degradation pipeline.
Deployment: The model can be deployed using standard deep learning inference frameworks. The sliding window approach means that real-time processing might require optimizing the model and inference pipeline.

Concrete Examples and Applications:

Blind Video Face Restoration: Recovering details from videos with unknown and complex degradations (compression artifacts, noise, blur, low resolution). DicFace demonstrates state-of-the-art quantitative and qualitative results on the VFHQ dataset for this task, significantly improving temporal stability (lower TLME and competitive FVD) compared to previous methods.
Video Face Inpainting: Filling in missing regions (e.g., occlusions) in face videos. The continuous latent space helps generate coherent textures and structures that smoothly integrate with surrounding frames.
Video Face Colorization: Adding color to grayscale face videos. The model learns to predict realistic and temporally consistent colors.

The paper demonstrates that reformulating the latent space of VQ-VAEs with a Dirichlet constraint and variational inference is a powerful strategy for adapting image priors to video tasks while effectively addressing the critical issue of temporal consistency. The open-sourced code further facilitates the practical application and exploration of this approach.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - fudan-generative-vision/DicFace: official implement of paper:"DicFace: Dirichlet-Constrained Variational Codebook Learning for Temporally Coherent Video Face Restoration"

Tweets

https://twitter.com/JoeSiyuZhu/status/1935610511305167230