Leveraging Multimodal Behavioral Analytics for Automated Job Interview Performance Assessment and Feedback
(2006.07909v2)
Published 14 Jun 2020 in cs.LG, cs.CL, cs.CV, and stat.ML
Abstract: Behavioral cues play a significant part in human communication and cognitive perception. In most professional domains, employee recruitment policies are framed such that both professional skills and personality traits are adequately assessed. Hiring interviews are structured to evaluate expansively a potential employee's suitability for the position - their professional qualifications, interpersonal skills, ability to perform in critical and stressful situations, in the presence of time and resource constraints, etc. Therefore, candidates need to be aware of their positive and negative attributes and be mindful of behavioral cues that might have adverse effects on their success. We propose a multimodal analytical framework that analyzes the candidate in an interview scenario and provides feedback for predefined labels such as engagement, speaking rate, eye contact, etc. We perform a comprehensive analysis that includes the interviewee's facial expressions, speech, and prosodic information, using the video, audio, and text transcripts obtained from the recorded interview. We use these multimodal data sources to construct a composite representation, which is used for training machine learning classifiers to predict the class labels. Such analysis is then used to provide constructive feedback to the interviewee for their behavioral cues and body language. Experimental validation showed that the proposed methodology achieved promising results.
The paper introduces a multimodal analytical framework using facial expressions, speech, and text to automatically assess job interview performance.
The methodology extracts features from video, audio, and transcripts to train machine learning classifiers, with the Random Forest model showing high accuracy in predicting specific behaviors like speaking rate.
Results demonstrate varying performance based on modality combinations, highlighting the framework's potential while acknowledging dataset size limitations and outlining future work on real-time feedback systems.
The paper introduces a multimodal analytical framework designed for automated job interview performance assessment and feedback. This system analyzes candidates based on predefined labels such as engagement, speaking rate, and eye contact. The methodology integrates facial expressions, speech, and prosodic information derived from video, audio, and text transcripts to construct a composite representation, which is then used to train machine learning classifiers.
The paper leverages the MIT interview dataset, which includes recordings of 138 mock job interviews from 69 candidates, pre- and post-intervention. The dataset also incorporates Amazon Mechanical Turk Worker scores for each video, which are averaged to derive ground truth labels.
Here's a breakdown of the approach:
Prosodic Features: Time-domain features are extracted from raw audio signals using the magnitude of the Discrete Fourier Transform (DFT) to compute frequency domain features. Cepstral domain features are computed using the Inverse DFT on the logarithmic spectrum. A short-term window splits the audio signal into frames, and a feature vector of 40 elements is generated for each frame using the pyAudioAnalysis library.
Facial Expressions: OpenCV extracts facial landmarks, including the nose, mouth corners, chin, and eye corners, from each video frame. Local changes in these tracked interest points provide information about facial expressions. Head pose features (Pitch, Roll, and Yaw) are incorporated based on the rotation matrix R. Additionally, a pre-trained convolutional neural network, LeNet, is used to detect smiling, having been trained on the SMILES dataset of 13,165 grayscale face images.
Lexical Features: Text transcripts are obtained using the Google Cloud Speech-to-Text API, followed by cleaning, which includes lowercasing, punctuation removal, and tokenization via the Natural Language Toolkit. Speaking style features, such as the average number of words spoken per minute, unique words per minute, and filler words, are extracted. Emotion scores are derived, and the Tone Analyzer assesses the percentage of emotions (Joy, Sadness, Tentative, Analytical, Fear, and Anger) in each sentence. The Stanford Named Entity Recognizer (NER) is employed to count nouns, adjectives, and verbs.
The paper experimented with four machine learning algorithms: Random Forest, Support Vector Machine Classifier (SVC), Multitask Lasso Model, and Multilayer Perceptron (MLP). The optimization objective for Lasso is defined as:
1/(2∗nsamples))∗∣∣Y−XW∣∣2+α∣∣W∣∣21
where
∣∣W∣∣21=∑i∑jwij2
Where:
n represents the sample size.
Y is the vector of target values.
X is the training data.
W is the weight matrix.
α is a constant multiplied by the L1-norm of the coefficient vector.
The features extracted from text, audio, and video are used to construct a feature vector, which is then passed through the classifiers to predict class labels. The features include audio power, intensity, duration, pitch, zero-crossing rate, energy, entropy of energy, spectral centroid, spectral flux, spectral spread, spectral roll-off, Mel-Frequency Cepstral Coefficients (MFCCs), Chroma vector, Chroma deviation, facial landmarks, head pose, speaking rate, proficiency, fluency, total words spoken, unique words spoken, emotion of the text, emotion scores, and counts of nouns, verbs, and adjectives.
The data is normalized using standardization and scaled to unit variance. The standard score z of a sample x is calculated as:
z=(x−u)/s
Where:
u is the mean of the training samples.
s is the standard deviation of the training samples.
Automatic feature selection eliminates redundant features. K best feature selection involves calculating a correlation matrix and retaining k features with the highest scores. The Benjamini-Hochberg procedure decreases the false discovery rate. A threshold value of 0.6 is used for correlation to retain only one of the correlated features. Three-fold cross-validation is employed.
Experimental results demonstrated that the Random Forest Classifier generally outperformed the others, notably achieving 96.43% accuracy in predicting speaking rate. The Support Vector Classifier (SVC) achieved 75% accuracy for the "Engaged" label. However, the Multi-Layer Perceptron (MLP) underperformed in most parameters. The paper also examined the impact of different modality combinations, revealing that the performance varies significantly depending on the modalities used. The Random Forest classifier, trained on the three-modality feature vector, generally outperformed other variants.
The paper concludes by noting that the limited size of the dataset constrained the maximum achievable accuracy. Future work will focus on expanding the dataset and incorporating additional behavioral cues, such as hand movements and body posture. The goal is to integrate the model into a web application that provides real-time feedback to candidates, enhancing their interview performance.