Multimodal Interactive Deception Assessment

Updated 27 November 2025

MIDA is a benchmark framework for detecting deception in complex, interactive social settings using synchronized multimodal data.
It employs advanced fusion strategies and epistemic reasoning to integrate audio, video, and textual cues, addressing challenges like neutral bias.
Empirical results demonstrate improved accuracy and explainability, paving the way for innovative real-time deception analysis.

Multimodal Interactive Deception Assessment (MIDA) is a benchmark framework and methodology for automatic detection and analysis of deceptive behaviors in complex, interactive social settings using synchronized multimodal data. MIDA tasks the model with inferring the veracity of utterances within dynamic, multi-party or dyadic conversations by leveraging coordinated video, audio, and textual streams, and with modeling the social-epistemic context required for effective “reading of the room” (Kang et al., 20 Nov 2025, Kang et al., 31 Oct 2025, Miah et al., 11 Jun 2025, Rugolon et al., 26 Jun 2025).

1. Formal Problem Definition

MIDA operationalizes deception assessment as follows. Let $V=\{v_1,\dots,v_N\}$ be a set of $N$ multi-party conversational videos (e.g., from social deduction games), and $T=\{t_1,\dots,t_N\}$ be the corresponding dialogue transcripts. Each transcript $t_i$ is decomposed into a sequence of $M_i$ utterances $S_i=\{s_{i,1},\dots,s_{i,M_i}\}$ . The ground-truth veracity label is captured by a function

$L\colon\bigcup_i S_i\to\{\mathrm{TRUE},\mathrm{FALSE},\mathrm{NEUTRAL}\}$

where each utterance is labeled as factually correct, a lie, or non-verifiable. The model’s objective is, for each $s_{i,j}$ , to predict

$\hat y_{i,j} = f(v_i, t_i, h_{i,j-1})$

where $h_{i,j-1}$ is the preceding utterance history, and compare $\hat y_{i,j}$ against $y_{i,j}=L(s_{i,j})$ (Kang et al., 20 Nov 2025, Kang et al., 31 Oct 2025). MIDA further incorporates, in dyadic or group settings, joint feature extraction from all participants, potentially fusing sender and receiver signals, and can be implemented in both multi-turn and real-time interactive regimes (Rugolon et al., 26 Jun 2025).

2. Datasets and Annotation Protocols

The core MIDA datasets reflect ecologically valid deception contexts:

MIDA-Ego4D: 40 game-based sessions, 5–8 players, 819 utterances/session, comprehensively annotated (Kang et al., 20 Nov 2025).
MIDA-YouTube: 151 publicly uploaded game videos, 1,541 labeled utterances (Kang et al., 20 Nov 2025).
MU3D: 1,200 face-to-face interviews, video/audio/pose annotations, 5 crowd labels per clip (Cohen’s $\kappa = 0.72$ ) (Miah et al., 11 Jun 2025).
RLTD: 600 courtroom clips with expert-verified binary deception labels (Miah et al., 11 Jun 2025).
Dyadic Cohorts (e.g., Swedish): 22 dyads, 44 participants, sender/receiver recording, with post-hoc trustworthiness/interaction ratings (Rugolon et al., 26 Jun 2025).

Annotation steps (for high-stakes games):

Manual extraction of latent game state (who did what), enabling inference of available private knowledge.
LLM pipelines (Gemini-2.5-Pro) generate preliminary veracity labels using game rules and dialogue context.
Human validation audits yield $>$ 95% agreement with LLM for ground-truthing (Kang et al., 20 Nov 2025).

All video and transcript data are tightly synchronized via forced alignment. Facial crops and body bounding boxes are extracted (MTCNN, OpenPose). Audio is segmented per utterance, with low-level spectral/prosodic features (pitch, energy, formants) obtained (Parselmouth, Librosa, GeMAPS) (Kang et al., 20 Nov 2025, Rugolon et al., 26 Jun 2025).

3. Model Architectures and Fusion Strategies

MIDA benchmarks span a broad spectrum of architectures:

Multimodal LLMs (MLLMs): e.g., GPT-4o, Gemini-2.5-pro, Llama-3-8B, InternVL3.5-8B (Kang et al., 20 Nov 2025, Kang et al., 31 Oct 2025).
Vision-LLMs (LMMs): CLIP, BLIP-2; fine-tuned on frame-text and full game interactions (Miah et al., 11 Jun 2025).
Hybrid Mixture-of-Experts: Modality Interactive Mixture-of-Experts (MIMoE) for explicit gating based on unimodal agreement and semantic alignment, dynamically activating expert fusion blocks (Liu et al., 21 Jan 2025).
Late Fusion: Per-modality classifiers (audio, video, text) feeding a meta-classifier (decision tree) (Rugolon et al., 26 Jun 2025).
Early Fusion: Frame-aligned concatenation of audio/video features (Rugolon et al., 26 Jun 2025).
Interactive Components: Modality routers adapt channel weighting by real-time signal quality; user-feedback loops enable continuous adaptation via LoRA or online fine-tuning (Miah et al., 11 Jun 2025).

SoCoT (Social Chain-of-Thought) pipeline decomposes inference into:

Low-level perception: extraction of face, body, and voice primitives ( $z_{i,j}^{(\mathrm{face})}, z_{i,j}^{(\mathrm{body})}, z_{i,j}^{(\mathrm{voice})}$ ).
High-level social inference: theory-of-mind simulation $m_{i,j}$ .
Decision and rationale: aggregation and explanation $\hat y_{i,j}$ .

$z_{i,j}=\left[ z_{i,j}^{(\mathrm{face})}, z_{i,j}^{(\mathrm{body})}, z_{i,j}^{(\mathrm{voice})} \right], \quad m_{i,j} = g_\mathrm{ToM}(z_{i,j}, h_{i,j-1}), \quad \hat y_{i,j} = h_\mathrm{dec}(m_{i,j}, h_{i,j-1})$

(Kang et al., 20 Nov 2025).

Dynamic Social Epistemic Memory (DSEM) module maintains, for each player, a persistent board $M_p^t$ storing observed/felt/known states, updated each turn via multimodal signals (Kang et al., 20 Nov 2025):

$M_p^{t+1} = f_\mathrm{DSEM}(M_p^t, E_{t+1}, O_{t+1})$

4. Performance Benchmarks and Metrics

MIDA utilizes strict classification metrics reflecting the multi-class nature (TRUE, FALSE, NEUTRAL) and strong class imbalance:

Let $TP_c$ , $FP_c$ , $FN_c$ be the per-class (c) true positives, false positives, and false negatives.

Precision: $\mathrm{Precision}_c = \frac{TP_c}{TP_c + FP_c}$
Recall: $\mathrm{Recall}_c = \frac{TP_c}{TP_c + FN_c}$
F1: $F1_c = 2 \cdot \frac{ \mathrm{Precision}_c \cdot \mathrm{Recall}_c }{ \mathrm{Precision}_c + \mathrm{Recall}_c }$

Macro-averaged:

$\mathrm{F1}_{\mathrm{macro}} = \frac{1}{3} \sum_c F1_c$

Overall accuracy:

$\mathrm{Acc} = \frac{\sum_c TP_c}{\sum_c (TP_c + FP_c)}$

Binary accuracy (restricting to TRUE/FALSE in the denominator) is also commonly reported (Kang et al., 20 Nov 2025, Kang et al., 31 Oct 2025).

Empirical highlights (Kang et al., 20 Nov 2025, Kang et al., 31 Oct 2025, Rugolon et al., 26 Jun 2025):

GPT-4o achieves Acc $\approx 74\%$ , Macro-F1 $\approx 51\%$ on Ego4D.
GPT-4o-mini achieves best Binary Acc (39.4%) and F1 $_\mathrm{TRUE}$ (36.1%) on Ego4D-MIDA.
Late fusion in dyadic settings combining both modalities/participants yields $\mathrm{Acc}=0.71$ (Rugolon et al., 26 Jun 2025).
Open-source models lag behind: Qwen2.5-VL Macro-F1 $=48.8\%$ ; DeepSeek-R1-8B Macro-F1 $=32\%$ (Kang et al., 20 Nov 2025).

5. Failure Analyses and Cognitive Bottlenecks

MIDA experiments consistently expose core limitations in current multimodal models:

Conservative NEUTRAL Bias: Models overpredict NEUTRAL ( $>$ 85% F1) at the expense of FALSE, due to risk-averse alignment and class imbalance in training distributions (Kang et al., 20 Nov 2025, Kang et al., 31 Oct 2025).
Lack of Theory-of-Mind: Models lack explicit epistemic modeling of what each participant knows or believes, critical for distinguishing deliberate lies from honest mistakes or unverifiable statements (Kang et al., 20 Nov 2025, Kang et al., 31 Oct 2025).
Weak Multimodal Grounding: Increasing the number of visual frames, or ablating text/temporal context, does not confer improvement; models generally fail to exploit facial or gestural information reliably—visual descriptions often do not impact final deception verdicts (Kang et al., 20 Nov 2025, Kang et al., 31 Oct 2025).
Limited Integration of Receiver Cues (Dyadic): Inclusion of synchrony and receiver features substantially improves accuracy, highlighting the inadequacy of sender-only approaches (Rugolon et al., 26 Jun 2025).

6. Enhanced Reasoning Pipelines and Theoretical Extensions

To address these bottlenecks, MIDA research proposes:

Social Chain-of-Thought (SoCoT): Decomposition of multimodal reasoning into symbolic perception, high-level epistemic inference, and explanatory decision-making. SoCoT yields measurable gains over direct MLLM prompting (Acc +8%, Macro-F1 +2.6 points) (Kang et al., 20 Nov 2025).
Dynamic Social Epistemic Memory (DSEM): Persistent, player-specific state tracking for belief modeling; shown to improve Macro-F1 and binary accuracy by 1.6–3.3 points in open-source and closed-source models (Kang et al., 20 Nov 2025).
Mixture-of-Experts Fusion: Gating ensembles based on measured modality agreement/alignment route examples to specialized fusion experts, effectively reconciling concordant and discordant modality signals (Liu et al., 21 Jan 2025).
Chain-of-Thought Prompting: Step-by-step inference over linguistic, prosodic, and gestural evidence, shown to increase explainability (but not always faithfulness) of verdicts (Miah et al., 11 Jun 2025).
Dyadic Synchrony and Interpersonal Features: Quantifying audio-visual synchrony (e.g., $\rho_m(w)$ over feature windows), decision-tree meta-fusion, and inclusion of both sender and receiver modalities for robust deception cue extraction (Rugolon et al., 26 Jun 2025).

7. Open Challenges and Future Directions

Key directions for advancing MIDA’s efficacy include:

Innate Theory-of-Mind Integration: Moving beyond post-hoc prompts to architectures with built-in ToM modules or graph-structured epistemic reasoning (Kang et al., 20 Nov 2025, Kang et al., 31 Oct 2025).
Context-Adaptive Alignment: Allowing calibrated, context-sensitive risk-taking in model outputs, enabling more accurate judgments under uncertainty (Kang et al., 31 Oct 2025).
Robust Multimodal Embedding: Developing representation learning that filters noise and isolates salient behavioral cues for deception inference (Kang et al., 20 Nov 2025).
Cross-Domain, Cross-Cultural Transfer: Addressing domain shift and cultural variation in gesture/speech cues; supporting multilingual, multicultural adaptation with Bayesian/uncertainty-aware fusion (Miah et al., 11 Jun 2025).
Human-in-the-Loop Adaptation: Leveraging active learning, user feedback, and online fine-tuning for continual improvement (Miah et al., 11 Jun 2025).
Deployment: Efficient streaming architectures, on-device model distillation, and interface dashboards presenting verdicts with explanatory rationales and modifiable feedback (Miah et al., 11 Jun 2025).

A plausible implication is that genuine “reading the room” for deception in open domains will require models capable of real-time, contextually grounded, and cognitively inspired multimodal social reasoning, integrating both sender and receiver, and robust to shifts in domain, channel, and social context.

References: