MOL: Joint Estimation of Micro-Expression, Optical Flow, and Landmark via Transformer-Graph-Style Convolution

Published 17 Jun 2025 in cs.CV | (2506.14511v1)

Abstract: Facial micro-expression recognition (MER) is a challenging problem, due to transient and subtle micro-expression (ME) actions. Most existing methods depend on hand-crafted features, key frames like onset, apex, and offset frames, or deep networks limited by small-scale and low-diversity datasets. In this paper, we propose an end-to-end micro-action-aware deep learning framework with advantages from transformer, graph convolution, and vanilla convolution. In particular, we propose a novel F5C block composed of fully-connected convolution and channel correspondence convolution to directly extract local-global features from a sequence of raw frames, without the prior knowledge of key frames. The transformer-style fully-connected convolution is proposed to extract local features while maintaining global receptive fields, and the graph-style channel correspondence convolution is introduced to model the correlations among feature patterns. Moreover, MER, optical flow estimation, and facial landmark detection are jointly trained by sharing the local-global features. The two latter tasks contribute to capturing facial subtle action information for MER, which can alleviate the impact of insufficient training data. Extensive experiments demonstrate that our framework (i) outperforms the state-of-the-art MER methods on CASME II, SAMM, and SMIC benchmarks, (ii) works well for optical flow estimation and facial landmark detection, and (iii) can capture facial subtle muscle actions in local regions associated with MEs. The code is available at https://github.com/CYF-cuber/MOL.

Abstract PDF Upgrade to Chat

Authors (7)

Summary

The paper introduces MOL’s novel multitask framework that jointly estimates micro-expressions, optical flow, and landmarks to capture subtle facial movements.
It employs a unique F5C block that combines fully-connected and channel correspondence convolutions for enhanced global and local feature extraction.
Empirical results on CASME II, SAMM, and SMIC datasets demonstrate MOL’s superior accuracy and weighted F1 scores without relying on key frame extraction.

Examining the Multitask Learning of Micro-Expression Recognition through Transformer-Graph Convolutions

The paper "MOL: Joint Estimation of Micro-Expression, Optical Flow, and Landmark via Transformer-Graph-Style Convolution" addresses the complexities of Facial Micro-Expression Recognition (MER), a niche yet pivotal problem in computer vision and affective computing. The authors propose a novel multitask framework to recognize micro-expressions by jointly estimating optical flow and facial landmarks, thus enhancing feature extraction capabilities.

The complexity of MER stems from the transient nature of micro-expressions, which are brief and subtle facial actions that often fall below the limits of conscious perception. Traditional approaches have relied heavily on hand-crafted features, key frames like onset and apex, or deep neural networks but limited by small-scale datasets, restricting their efficacy. This work innovates by developing an end-to-end deep learning framework, taking advantage of a unique feature extraction block—F5C—blending the principles of transformers and graph convolutions with vanilla convolutions.

The F5C block introduces two critical components: Fully-Connected Convolution (FCC) and Channel Correspondence Convolution (CCC). FCC is designed to maintain global receptive fields while extracting local features, reminiscent of transformer architectures. CCC, on the other hand, seeks to model the correlations between feature map patterns, similar to operations performed in graph-based convolutions. Through these components, the authors aim to alleviate the prevalent issues faced in capturing the nuanced, subtle movements that characterize micro-expressions.

Empirical evaluations were conducted on established datasets, namely CASME II, SAMM, and SMIC, showing MOL's superiority over existing methods in terms of accuracy and weighted F1 scores. It's relevant to note that MOL stands aside from other methods by not requiring prior knowledge of key frames, a constraint many prior systems carry. This flexibility owes much to the framework's comprehensive design and capable multitask learning, utilizing optical flow estimation and facial landmark detection as auxiliary tasks boosting the main MER task. Interestingly, the results indicated that both auxiliary tasks significantly aided MER, with optical flow estimation proving slightly more critical among the two.

The incorporation of known datasets, alongside the decision to handle raw frame images directly, indicates MOL's design robustness and improved generalization capabilities across different expression categories and unseen samples. Importantly, the proposal answers a fundamental challenge in deep learning-powered solutions for MER: handling small-scale datasets through auxiliary task sharing and a robust feature extraction mechanism within F5C blocks.

Theoretical contributions align well with the implementation, potentially guiding future developments in deep learning frameworks tackling MER and similar computer vision tasks. The framework leverages and innovates on multitask learning paradigms, suggesting broader applications beyond micro-expression recognition as it adapts multiple subtasks to inform a more accurate detection and classification pipeline.

In summary, MOL surpasses many existing methods, depicting a promising trajectory for comprehensive facial action recognition systems. Its design reflects a mature appreciation of the intrinsic difficulties associated with micro-expression recognition and offers a robust alternative ready for further exploration. This line of research opens pathways for subsequent AI systems to seamlessly integrate multiple tasks, promising more accurate representations and analyses of subtle, often subconscious, human emotions captured through facial expressions.

Markdown Report Issue