MIT Interview Dataset Overview
- MIT Interview Dataset is a multimodal collection of annotated mock interviews capturing synchronized audio, video, and text data for in-depth behavioral analysis.
- It uses high-quality recordings and crowdsourced ratings to assess 16 behavioral traits through regression models like SVR and Lasso.
- The dataset supports automated interview coaching and predictive analytics, driving advancements in AI-driven feedback and social signal processing.
The MIT Interview Dataset refers to a multimodal, richly annotated collection of mock job interviews and associated behavioral ratings, developed to support automated analysis and prediction of interview performance traits. It is extensively used as a public benchmark in computational behavioral research, multimodal analytics, and machine learning for social interaction contexts.
1. Dataset Composition and Collection Protocol
The MIT Interview Dataset consists of 138 audio-visual recordings of mock interviews conducted with 69 undergraduate students seeking internships at MIT (each candidate participated twice). Each interview averages 4.7 minutes, resulting in over 10.5 hours of video data. The data capture protocol involves two synchronized cameras recording both the interviewee and interviewer, alongside high-quality audio. All interviews are professionally transcribed, with explicit annotation of filler words and disfluencies using Mechanical Turk.
Ground truth for behavioral analysis is established from crowdsourced ratings: 9 independent annotators rated each video on 16 behavioral traits using a 7-point Likert scale. These traits encompass social/interpersonal qualities (engagement, friendliness, excitement), performance (overall, hiring recommendation), and micro-level behaviors (smile, eye contact, speaking rate, use of fillers/pauses, authenticity, stress, awkwardness, answer structure). The labels are aggregated using an Expectation-Maximization (EM) style algorithm that estimates both the true trait score and rater reliability .
| Modality | Data Type | Behavioral Labels (examples) |
|---|---|---|
| Video | Interview recordings | Smile intensity, head gestures, eye contact |
| Audio | Speech signal | Pitch, intonation, prosody, speaking rate |
| Text | Transcripts, LIWC codes | Word frequency, filler count, pronoun usage |
2. Multimodal Feature Extraction and Analysis Framework
The computational analysis framework extracts concatenated features from three principal modalities:
- Prosodic: Fundamental frequency (F0), intensity, formants, pause statistics, jitter, shimmer.
- Lexical: Words/second, unique words, filler/second, LIWC-derived categories (e.g., "I" vs. "we"), topic frequencies assessed by Latent Dirichlet Allocation (LDA).
- Facial: Smile intensity scaled from 0–100 (AdaBoost classifier), head nods/shakes, facial landmarks (eyebrow/lip metrics via Constrained Local Model tracking).
All features are zero-mean/unit-variance normalized and concatenated into a multimodal vector per interview. The framework utilizes regression models—Support Vector Regression (SVR) and Lasso (with penalty)—to predict behavioral trait scores from extracted features.
- SVR: , minimizing .
- Lasso: (enforces feature sparsity).
3. Behavioral Metrics, Trait Prediction, and Feature Importance
Quantitative metrics are derived for both verbal and nonverbal behaviors. Verbal features include quantity metrics (words/sec, unique words/sec), filler and pause counts, and LIWC-based linguistic categories. Nonverbal prediction relies on detailed prosodic contours and high-resolution facial dynamics.
Correlational analysis with human judgments reveals:
- Engagement, friendliness, excitement: (high predictability from the model)
- Overall performance/hiring recommendation:
- Two-class ROC-AUC: (baseline )
Feature weight inspection demonstrates that prosodic variables most strongly predict engagement/excitement, lexical markers like "we" and unique word use are decisive for overall recommendation, and facial cues (primarily smile intensity) govern predictions for friendliness.
4. Automated Feedback and Interview Coaching Recommendations
Regression-derived feature weights are operationalized as performance feedback. The model advises interviewees to:
- Increase fluency (more words/sec, more unique words, fewer fillers/pauses)
- Prefer collective language ("we" over "I")
- Maintain friendly affect (higher genuine smile intensity)
- Use positive emotional/quantitative language; avoid negative emotion categories
These recommendations align with established career guidance and are numerically substantiated by feature importance analysis.
5. Temporal Effects and First Impression Analysis
Temporal segmentation of interviews enables the paper of impression dynamics. Performance on the initial question, "Tell me about yourself," exhibits the highest correlation with overall ratings. Thereafter, temporal correlations generally decline, although closing questions may induce a minor rebound for select traits. This underscores the measurable importance of first impression formation during interviews.
6. Applications, Extensions, and Dataset Availability
The MIT Interview Dataset is made available for academic research to validate and expand upon automated social signal analysis frameworks. It serves as the backbone for multimodal analytics systems assessing interpersonal and behavioral traits (Naim et al., 2015), for classifier-driven behavioral feedback (Agrawal et al., 2020), and for studies in computational social science and text mining (Karlgren et al., 2020). The methodological approach directly informs AI interview coaching systems, human-computer interaction studies, and annotation ethics across computational psychology domains.
A plausible implication is that future research may further augment the dataset with additional modalities (e.g., hand gestures, posture) or integrate it into interactive AI testing and benchmarking systems.
7. Impact and Future Directions
The MIT Interview Dataset is foundational for multimodal behavioral prediction in interview settings, facilitating rigorous, reproducible research into automated assessment, AI-driven feedback, and social signal processing. Its transparent labeling protocol and comprehensive multimodal feature set position it as a reference for subsequent methodological innovations and extensions to richer, more interactive datasets. The practical impact extends to designing intelligent feedback tools, objective behavioral analytics, and informing the next generation of AI-assisted interview practice platforms.