Discrete Facial Encoding: : A Framework for Data-driven Facial Display Discovery (2510.01662v1)

Published 2 Oct 2025 in cs.CV

Abstract: Facial expression analysis is central to understanding human behavior, yet existing coding systems such as the Facial Action Coding System (FACS) are constrained by limited coverage and costly manual annotation. In this work, we introduce Discrete Facial Encoding (DFE), an unsupervised, data-driven alternative of compact and interpretable dictionary of facial expressions from 3D mesh sequences learned through a Residual Vector Quantized Variational Autoencoder (RVQ-VAE). Our approach first extracts identity-invariant expression features from images using a 3D Morphable Model (3DMM), effectively disentangling factors such as head pose and facial geometry. We then encode these features using an RVQ-VAE, producing a sequence of discrete tokens from a shared codebook, where each token captures a specific, reusable facial deformation pattern that contributes to the overall expression. Through extensive experiments, we demonstrate that Discrete Facial Encoding captures more precise facial behaviors than FACS and other facial encoding alternatives. We evaluate the utility of our representation across three high-level psychological tasks: stress detection, personality prediction, and depression detection. Using a simple Bag-of-Words model built on top of the learned tokens, our system consistently outperforms both FACS-based pipelines and strong image and video representation learning models such as Masked Autoencoders. Further analysis reveals that our representation covers a wider variety of facial displays, highlighting its potential as a scalable and effective alternative to FACS for psychological and affective computing applications.

Summary

Discrete Facial Encoding: A Framework for Data-driven Facial Display Discovery

Introduction

The paper introduces Discrete Facial Encoding (DFE), a novel unsupervised framework for facial expression analysis that overcomes the limitations of traditional methodologies like the Facial Action Coding System (FACS). Utilizing a Residual Vector Quantized Variational Autoencoder (RVQ-VAE), the model learns a data-driven, compact, and interpretable dictionary of facial expressions directly from 3D mesh sequences. This approach not only enhances the scope of facial display encoding beyond predefined AU combinations but also eliminates the dependence on extensive manual annotations by operating in an unsupervised manner.

Framework Overview

The proposed DFE framework extracts identity-invariant expression features using a 3D Morphable Model (3DMM), employing these features as inputs to an RVQ-VAE to encode images into sequences of discrete tokens. Each token in the shared codebook encapsulates a reusable facial deformation pattern, contributing to a holistic expression representation. An overview of the expression coding framework illustrates the conversion of expression vectors into discrete tokens, showcasing the interpretability and additive nature of the representation.

Figure 1: Overview of our proposed expression coding framework. Expression vectors extracted from a 3DMM model are encoded into discrete tokens using RVQ-VAE.

Implementation Details

The implementation leverages EMOCA and the RVQ-VAE framework. The 3DMM features are reshaped into a series of tokens, mapped into a latent space using a Transformer encoder, and quantized using residual vectors to produce a unique sequence of tokens. This enables visualization and interpretation of facial expressions in a structured manner, facilitating scalable learning across diverse datasets. Several regularization terms such as orthogonality and sparsity are utilized to ensure the learned codebook captures detailed, non-redundant regions.

Experimental Validation

The effectiveness of DFE is validated through experiments across high-level psychological tasks including stress detection, personality prediction, and depression detection. A simple Bag-of-Words model constructed over learned tokens demonstrated superior performance against both FACS-based pipelines and contemporary deep learning models like Masked Autoencoders. Expression retrieval metrics indicated that the DFE outperforms AU-based systems in both expression accuracy and diversity.

Figure 2: Qualitative retrieval examples comparing our token-based representation (DFE) with AU-based encoding.

Results and Implications

DFE's quantitative and qualitative results highlight its efficacy in capturing intricate and diverse facial behaviors. The learned representations show enhanced capability for downstream psychological inferences, encompassing a wider range of expressions. This opens new avenues for scalable applications in domains requiring robust, interpretable facial analysis, beyond the abilities of conventional FACS methods. Discrete tokens learned through the RVQ-VAE hold potential for broader adoption in affective computing and behavioral health diagnostics.

Figure 3: Visualization of the learned facial templates.

Conclusion

The introduction of the DFE framework marks a significant stride towards automating facial display analysis. By eschewing traditional supervised techniques, it offers a scalable, precise, and interpretable process for capturing facial expressions across various applications. The empirical results affirm its potential as a comprehensive alternative to FACS, capable of transforming affective computing and emotion recognition paradigms. Future research will explore integrating multimodal signals to further enhance behavioral modeling capabilities.