Papers
Topics
Authors
Recent
2000 character limit reached

Human-Human Interaction Anomaly Detection

Updated 22 December 2025
  • Human-Human Interaction Anomaly Detection is the study of identifying abnormal patterns in coordinated, multi-person engagements using 3D joint data.
  • Recent models, like IADNet, integrate temporal attention and distance-based relational encoding to markedly improve AUROC scores over single-person detectors.
  • H2IAD research informs applications in collaborative robotics, behavioral health, and executive coaching while addressing challenges like temporal localization and data noise.

Human-Human Interaction Anomaly Detection (H2IAD) encompasses the identification of abnormal or rare behaviors that arise specifically in the context of interpersonal interactions, rather than in isolated individual actions. As humans are inherently collaborative and their behaviors are often tightly coordinated, H2IAD addresses the unique challenge of capturing and modeling the temporal, spatial, and relational complexity present when two or more individuals interact. Recent advances in 3D human pose estimation and machine learning architectures, coupled with new benchmark datasets, have enabled systematic analysis of H2IAD as a distinct research problem, distinct from single-person anomaly detection, with particular relevance to domains such as collaborative robotics, social signal processing, behavioral health monitoring, and high-stakes assessments such as executive coaching (Maeda et al., 15 Dec 2025, Arakawa et al., 2022).

1. Problem Definition and Task Formulation

H2IAD is formally cast as a one-class anomaly detection problem, where the model is trained on normal interactions between pairs (or groups) of humans, denoted as I(c)={X(c),Y(c)}I^{(c)} = \{X^{(c)}, Y^{(c)}\} with X(c)=[x1(c),...,xT(c)]X^{(c)} = [x_1^{(c)}, ..., x_T^{(c)}] and Y(c)=[y1(c),...,yT(c)]Y^{(c)} = [y_1^{(c)}, ..., y_T^{(c)}], where xix_i, yiR3Dy_i \in \mathbb{R}^{3D} represent the DD anatomical joints in 3D space at each frame ii. Only normal-class samples are observed during training. At inference, the task is to detect clips or moments that deviate from the learned distribution, signaling the presence of anomalous interaction patterns (Maeda et al., 15 Dec 2025). Anomaly scoring is operationalized via functions S(I)S(I) that return low values for normal-class inputs and high values for anomalous interactions. This open-set recognition challenge is further complicated by the need to handle rare, context-dependent outliers and the complex, asymmetric, and often synchronized dynamics of multiple agents.

2. Model Architectures and Methodological Advances

While earlier anomaly detection frameworks, such as the REsCUE system, leveraged Gaussian Mixture Models (GMM) for single-session, session-specific statistical modeling of multimodal nonverbal feature vectors (e.g., head pose, body posture, movement energy) (Arakawa et al., 2022), recent work has introduced architectures explicitly designed for two-person 3D interactions. The Interaction Anomaly Detection Network (IADNet) exemplifies this progression (Maeda et al., 15 Dec 2025). IADNet comprises:

  • Temporal Attention Sharing Module (TASM): Stacked units of synchronized, parameter-sharing Transformers processing each person’s motion, complemented by intra-stream self-attention and cross-stream motion attention to capture collaborative temporal dynamics.
  • Distance-Based Relational Encoding Module (DREM): Injects pairwise joint-to-joint distance matrices as relational tokens, enabling the model to encode explicit social spatial cues not available to purely temporal models.
  • Normalizing Flow-based Scoring: Final likelihood estimates are computed via a bijective mapping ϕ\phi, yielding exact anomaly scores as logp(f(I))-\log p(f(I)).

Key architectural features include learnable synchronized positional encodings (critical for aligning temporal tokens across people), parameter sharing between streams (forcing a unified representation of interaction), and explicit modeling of joint proximity.

3. Datasets, Training, and Evaluation Protocols

H2IAD research employs large-scale supervised datasets capturing normal and anomalous human-human interactions in 3D skeleton form. Typical datasets include:

Dataset #Classes #Joints Training Splits
Inter-X 40 64 9,110 train, 2,278 test clips
NTU RGB+D 120 26 56 Cross-subject train/test splits

Model training is conducted per interaction category for 50 epochs, using Adam with learning rate linearly decayed from 10310^{-3} to 10510^{-5}. No explicit dropout is reported; regularization emerges from parameter sharing in TASM. AUROC is the primary metric, averaged over all one-vs-all anomaly detection sub-tasks. Baselines include single-person action AD models such as STG-NF, MoCoDAD, and ML-AAD, typically adapted by selecting the better of two independently trained single-person detectors. IADNet achieves substantial improvements in AUROC (e.g., 0.707 on Inter-X vs. 0.626 for ML-AAD), with especially large gains on interaction classes characterized by spatial synchrony and body contact (Maeda et al., 15 Dec 2025).

4. Interpretability and Human-AI Collaborative Systems

Interpretability in H2IAD systems is recognized as paramount—particularly in applied domains such as executive coaching—where unsupervised anomaly detection serves as an observation interface, decoupled from interpretation and intervention. The REsCUE and INWARD systems (Arakawa et al., 2022) illustrate a workflow where detected outliers (frames with highest a(xt)=logp(xtΘ)a(x_t) = -\log p(x_t|\Theta)) and exemplars (frames with highest r(xt)=+logp(xtΘ)r(x_t) = +\log p(x_t|\Theta)) are presented side-by-side to human experts, who are tasked with contextualizing, annotating, and reflecting upon these machine-surfaced cues. No attempt is made to extract explicit rules or feature attributions, as interpretive nuance is deferred to the human domain expert. This pipeline is found to support both expert insight articulation and novice coach education via meta-reflection tools.

5. Empirical Performance and Ablation Studies

Empirical evaluation demonstrates that models tailored to H2IAD, particularly those integrating temporal synchronization and inter-personal spatial encoding, outperform single-person baselines by wide margins on contemporary benchmarks. Ablations on IADNet confirm the critical contribution of:

  • Synchronized positional encodings (improving Inter-X AUROC from 0.633/0.657 to 0.707)
  • DREM augmentation (AUROC increase from 0.641 to 0.707)
  • Parameter sharing in TASM (AUROC increase from 0.624 to 0.707)

Further analysis indicates that categories with extensive joint-to-joint displacement (e.g., Massaging leg, Dance) gain most from explicit distance modeling, whereas minimal-distance interactions see smaller gains. Anomalies based purely on object interaction (e.g., hits with objects) remain challenging due to the absence of object cues in skeleton-only representations (Maeda et al., 15 Dec 2025).

6. Limitations and Directions for Future Research

Current H2IAD methods assume per-clip anomaly labels and do not temporally localize anomalies within clips. Addressing temporally finer-grained detection is an open research problem. Accurate 3D pose estimation is a prerequisite; noisy joint estimations can adversely impact spatial encoding modules like DREM. Datasets are presently limited to skeletons; modeling human-object and multi-human interactions jointly remains an unsolved challenge. All models to date are evaluated in unsupervised one-class settings, with extensions to semi-supervised or weakly supervised frameworks (using a small number of anomaly exemplars) proposed as plausible next steps. Expanding to higher-order interactions (beyond dyads), as well as adapting architectures for real-time or online anomaly flagging (with behavior drift accommodation), are currently underexplored areas (Maeda et al., 15 Dec 2025, Arakawa et al., 2022).

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Human-Human Interaction Anomaly Detection (H2IAD).