REACT2024 Benchmark Dataset
- REACT2024 is a benchmark dataset offering segmented, multimodal dyadic video conference interactions with comprehensive facial annotations for improved facial reaction synthesis.
- The dataset underpins two primary tasks—offline and online facial reaction generation—by providing pairwise audiovisual recordings with fine-grained frame-level facial attributes and standardized evaluation metrics.
- Baseline models like REGNN, BeLFusion, and Trans-VAE demonstrate varied trade-offs in achieving appropriateness, diversity, realism, and synchrony, offering actionable benchmarks for future research.
The REACT2024 dataset is a benchmark resource comprising segmented, multimodal dyadic video conference interactions, used for the systematic paper and evaluation of machine learning models tasked with generating multiple appropriate, realistic, diverse, and temporally synchronised human facial reactions in response to conversational partner behavior. Originating from the REACT2023/2024 challenges, this dataset uniquely supports both offline and online facial reaction generation, with pairwise annotated audiovisual recordings, fine-grained frame-level facial attributes, and standardized evaluation metrics for reaction quality, diversity, and synchrony.
1. Dataset Origin and Composition
The REACT2024 dataset is composed of 30-second interaction clips sourced from the NoXi and RECOLA video-conference corpora. The NoXi corpus contributes 5,870 clips (about 49 hours), while RECOLA adds 54 clips (~0.4 hours). Each segment is extracted to ensure that both participants’ faces are fully visible and behavior is naturalistic in the context of video conferencing. The dataset encompasses a variety of conversational scenarios, ranging from information exchange to affectively charged interactions, capturing variability in both verbal and nonverbal communicative dynamics.
For large-scale training and evaluation, the ReactDiff paper utilizes 2,962 sessions with a partition of 1,594 for training, 562 for validation, and 806 for testing. Each session consists of synchronized speaker and listener tracks, facilitating the modeling of the dyadic “action–reaction” cycle fundamental to natural conversation.
2. Multimodal Annotations and Feature Extraction
Every frame in the dataset is annotated with a 25-dimensional attribute vector. This includes:
- 15 facial action unit (AU) occurrences, predicted using the GraphAU model,
- 2 continuous affect measures: valence and arousal,
- 8 frame-level probabilities for canonical expressions (e.g., happy, sad).
In ReactDiff, additional 3D Morphable Model (3DMM) coefficients are computed per frame using the FaceVerse framework. These 3DMM coefficients cover expression, pose, and translation parameters (e.g., “BrowInnerUp,” “JawOpen,” head orientation), and serve as interpretable soft annotations of facial configuration. Speaker audio is encoded with features such as Mel-frequency cepstral coefficients (MFCCs, 78D, via Torchaudio) or through models like wav2vec2.0 for capturing prosody and acoustic context.
This comprehensive annotation enables joint modeling of expressive variation, head movement, affective state, and inter-personal synchrony, with sufficient granularity for state-of-the-art generation and evaluation approaches.
3. Challenge Structure and Guidelines
The REACT2024 challenge establishes two primary sub-tasks:
- Offline Multiple Appropriate Facial Reaction Generation: Models generate the full listener facial reaction sequence – both facial attribute time series and 2D renderings – conditioned on the entire speaker behavior for a segment.
- Online Multiple Appropriate Facial Reaction Generation: Models produce the reaction sequence in a streaming (causal) fashion, synchronously with the progressing speaker behavior, simulating real-time scenarios.
Participants must submit not only model checkpoints and generated results but also source code and technical reports, fostering reproducibility and methodological transparency. Model evaluation focuses on four criteria: appropriateness, diversity, realism, and synchrony, each quantified through established and challenge-specific metrics.
4. Baseline Methods and Performance Metrics
Three baseline approaches provide foundational benchmarks:
- Trans-VAE: A CNN extracts visual speaker features, a transformer fuses these with audio, and a generative decoder predicts the distribution over 3DMM coefficients and facial attribute time series. This baseline implements a direct temporal mapping, using an architecture inspired by TEACH.
- BeLFusion: A two-step process where a VAE learns a latent representation of visual reaction features, followed by a latent diffusion model (LDM) that predicts these representations, conditionally using a sliding window over past speaker features. Trade-offs between reaction diversity and synchrony are modulated via AU binarization.
- REGNN: The Reversible Graph Neural Network system encodes the full possible reaction distribution over a Gaussian mixture graph, then samples appropriate facial reactions using a reversible GNN-based “motor processor.” This reframing supports the one-to-many response paradigm inherent to naturalistic reactions.
Evaluation employs the following metrics:
- Appropriateness: Facial Reaction Correlation (FRCorr), typically using Concordance Correlation Coefficient (CCC).
- Realism: Fréchet Inception/Video Distance (FID/FVD).
- Diversity: FRDiv, FRDvs, FRVar.
- Synchrony: FRDist (Dynamic Time Warping), FRSyn.
Baseline results indicate that REGNN achieves superior appropriateness and synchrony, while BeLFusion yields higher diversity; Trans-VAE performs robustly across both offline and online modalities.
| Baseline Model | FRCorr (Appropriateness) | FRDiv (Diversity, ↑) | FRSyn (Synchrony, ↓) |
|---|---|---|---|
| REGNN | Highest | Moderate | Lowest |
| BeLFusion | Moderate | Highest | Moderate |
| Trans-VAE | Reference-level | Reference-level | Reference-level |
↑ = higher is better; ↓ = lower is better; all precise definitions per original metrics as described in challenge guidelines.
5. Data Preprocessing and Utilization Protocols
For each session, raw video is preprocessed to extract 3DMM coefficients for both speaker and listener via FaceVerse, coupled with synchronised MFCCs from audio input. Data are segmented into short temporal windows (e.g., frames) to both reflect cognitive delays characteristic of human listeners and facilitate temporally coherent sequence prediction in the modeling pipeline.
The paired structure (speaker input, listener reaction) supports conditional generation and enables direct quantitative assessment of contextual appropriateness, diversity, and synchrony. Each generated sequence can be rendered into 2D imagery using a neural renderer such as PIRender for visual inspection and further automated evaluation.
6. Applications and Research Impact
The REACT2024 dataset catalyzes research in several directions:
- Conditional generation of listener facial reactions given variable speaker input in dyadic settings.
- Temporal modeling of human facial kinematics under conversational constraints.
- Evaluation methodologies for open-ended, multimodal generation tasks, including one-to-many mappings.
- Systems optimization for online human-computer interaction environments, where diversity, realism, and synchrony are essential.
The dataset’s comprehensiveness and standardized protocols enable comparative benchmarking of new model architectures and training paradigms, as demonstrated in recent work such as ReactDiff, where diffusion models incorporating prior knowledge of spatio-temporal facial dynamics set new standards on appropriateness, diversity, and realism (Cheng et al., 6 Oct 2025).
7. Access and Resources
The dataset and benchmark code for all referenced baselines are openly available to the research community at https://github.com/reactmultimodalchallenge/baseline_react2024. This repository includes scripts for data handling, model instantiation, training, and inference, as well as implementations for all primary evaluation metrics, facilitating transparency and fostering further developments within the field.
A plausible implication is that the continued adoption and development of protocols based on REACT2024 will advance the state of multimodal generative modeling, especially for applications in affective computing and human-computer interaction requiring sensitivity to conversational context, diversity of appropriate responses, and multimodal temporal realism.