X2C: A Dataset Featuring Nuanced Facial Expressions for Realistic Humanoid Imitation

Published 16 May 2025 in cs.RO, cs.AI, and cs.HC | (2505.11146v1)

Abstract: The ability to imitate realistic facial expressions is essential for humanoid robots engaged in affective human-robot communication. However, the lack of datasets containing diverse humanoid facial expressions with proper annotations hinders progress in realistic humanoid facial expression imitation. To address these challenges, we introduce X2C (Anything to Control), a dataset featuring nuanced facial expressions for realistic humanoid imitation. With X2C, we contribute: 1) a high-quality, high-diversity, large-scale dataset comprising 100,000 (image, control value) pairs. Each image depicts a humanoid robot displaying a diverse range of facial expressions, annotated with 30 control values representing the ground-truth expression configuration; 2) X2CNet, a novel human-to-humanoid facial expression imitation framework that learns the correspondence between nuanced humanoid expressions and their underlying control values from X2C. It enables facial expression imitation in the wild for different human performers, providing a baseline for the imitation task, showcasing the potential value of our dataset; 3) real-world demonstrations on a physical humanoid robot, highlighting its capability to advance realistic humanoid facial expression imitation. Code and Data: https://lipzh5.github.io/X2CNet/

Abstract PDF Upgrade to Chat

Summary

Overview of X2CNet and the X2C Dataset

This paper presents a comprehensive resource for advancing the ability of humanoid robots to imitate nuanced facial expressions, a critical component for effective affective human-robot interaction. The introduced dataset, X2C, along with the imitation framework, X2CNet, provides substantial improvements over previous approaches in the field.

X2C Dataset

The dataset, labeled X2C (Anything to Control), comprises a substantial collection of 100,000 image-control value pairs. Each image captures a humanoid robot displaying varied facial expressions, while the corresponding control values encode the expression configuration across various expression-relevant units. These units include the brows, lids, gaze, nose, mouth, head, and neck.

Key attributes of the dataset are:

Scale and Diversity: X2C is distinguished by its large scale and diversity, surpassing existing datasets in size and annotation dimensionality. Its inclusion of asymmetric facial expressions enhances the robot's capacity to reproduce human-like behavior.
Annotation Precision: Unlike datasets reliant on facial landmark estimations, X2C utilizes analytically calculated control values derived from interpolation functions, ensuring high accuracy and temporal alignment between images and control values.
Volunteers and Bias Elimination: Facial expression animations for data collection were curated by volunteers from diverse cultural backgrounds and gender, aiming to mitigate biases and ensure broad expression coverage.

The dataset's quality metrics, such as annotation accuracy and data diversity, emphasize its potential as a valuable resource for humanoid facial expression imitation.

X2CNet: Framework for Expression Imitation

X2CNet is designed to enable humanoid robots to imitate human facial expressions realistically by learning from the X2C dataset. It decomposes the expression imitation task into two modules:

Motion Transfer Module: Captures subtle expression dynamics from human faces.
Mapping Network: Learns the correspondence between humanoid facial expressions and their underlying control values.

This framework effectively outputs 30 continuous control values, enabling refined control over the humanoid's expressive units. Such detailed modeling is crucial for realistic expression imitation.

Experimental Findings

The paper reports significant findings from applying X2CNet to the task of predicting control values, where the model demonstrated superior performance with a mean absolute error (MAE) of 0.0114 on the test set, outperforming baseline approaches. The experiments underscored the robustness and efficacy of the proposed framework.

Statistical analyses provided further validation, ensuring confidence in the reliability of the results achieved through X2CNet. Ablation studies explored alternative architectures for the feature extractor, with VGG16 and ViT-B/16 showing promising outcomes in optimizing accuracy and computational efficiency.

Real-World Demonstrations and Implications

To illustrate the practical applicability of X2CNet, the paper includes real-world demonstrations of humanoid robots imitating diverse human facial expressions capturing a range of emotional nuances. These experiments involved human performers from multiple countries, showcasing X2CNet's capability to manage expression imitation across varied conditions and demographics.

The practical implications of this work are significant, not only advancing the fidelity of humanoid robots in social interactions but also opening avenues for applying such robots in fields like healthcare, education, and support for individuals with special needs. Future research directions could explore expanding the dataset to include emotion labels and adapting the framework to different humanoid platforms.

Conclusion

In summary, this paper presents the X2C dataset and X2CNet as key contributions to the field of humanoid facial expression imitation. By providing a high-quality, diverse dataset and a robust imitation framework, it sets the stage for further advancements in the development of emotionally intelligent robots capable of engaging and interacting with humans in more meaningful and nuanced ways.