Non-Verbal Vocalisations (NVVs)

Updated 9 August 2025

Non-Verbal Vocalisations (NVVs) are brief, non-word utterances that convey affective and paralinguistic information and structure conversational flow.
They encompass a diverse range of sounds—from laughter to coughs—classified into non-verbals, semi-verbals, and lexical interjections based on form and function.
Advances in deep learning and rich corpora have enabled robust NVV detection and classification, enhancing applications in accessible voice interfaces, expressive synthesis, and emotion recognition.

Non-Verbal Vocalisations (NVVs) are short non-word utterances produced by humans that convey affective or paralinguistic information and lack explicit linguistic (semantic) meaning. NVVs comprise a wide spectrum of acoustic events such as laughter, sighs, moans, cries, coughs, and mouth sounds (e.g., “pop”, “click”, “eh”). This category includes spontaneous emotional expressions and culturally-conventional interjections, as well as cues used for conversational structure. NVVs are pervasive in spoken communication and are integral to affective computing, human–computer interaction, and the design of accessible voice input modalities.

1. Historical Context and Conceptualization

Early psychological and linguistic theories treated NVVs as “interjections”—utterances at the boundary between language and emotion. Darwin and James emphasized their role in the bodily expression of emotion, while linguists such as Sapir and Bloomfield debated their classification as “primary” (pre-linguistic affective bursts) versus “secondary” (linguistically-conventional signals) (Batliner et al., 3 Aug 2025). In subsequent decades, NVVs were marginalized in linguistic theory and speech technology, typically dismissed as “uninteresting noise”. Renewed attention emerged with affect research, recognizing NVVs as “affect bursts”: efficient conveyors of emotional and paralinguistic states in both spontaneous and acted speech.

2. Taxonomy and Functions

NVVs can be classified along several axes (Batliner et al., 3 Aug 2025):

Formal types:
- Non-verbals: Pure vocalizations unconstrained by language-specific phonotactics (e.g., “ah”).
- Semi-verbals: Phonetically plausible sounds with no lexical meaning (e.g., “gee”, “ugh”).
- Verbals: Lexicalized expressions carrying both conventional and affective meaning (“oh dear”).
Functional roles:
- Affective: Express emotional or physiological states (laughter for joy, sobbing for sadness).
- Conversational structuring: Signal turn-taking, backchannel feedback (e.g., “uh-huh”, “mm”).
- Social/identity: Indicate group membership or speaker idiosyncrasy.
Contextual dependency:
- NVVs’ interpretation is highly dependent on prosodic realization and situational context—similar acoustic forms may fulfill different pragmatic functions (Batliner et al., 3 Aug 2025).

3. Modeling and Recognition Methodologies

Computational approaches to NVV detection and classification leverage deep learning architectures and robust feature extraction (Lea et al., 2022, Schuller et al., 2022, Koudounas et al., 22 Feb 2025). A representative model structure for NVV event detection uses:

Log-mel spectrogram extraction: 16 kHz audio is transformed into 64-dimensional log-mel features (25 ms window, 10 ms stride) (Lea et al., 2022).
Temporal convolutional networks (TCNs): Convolutional and grouped Conv1D layers with LeakyReLU and bottleneck-residual structures. Outputs framewise probabilities $p_{c,t}$ for each class $c$ at time $t$ . The receptive field (~270 ms) supports detection of both short and long NVV events.
Post-processing event logic:

for each frame t and class c:
    if p_{c,t} > θ_c for the last τ_c frames
       and max(p_{bg,t-50:t}, p_{speech,t-50:t}) < θ_bg
       and no other event in previous 50 frames:
           output event (c, t)

Binary cross-entropy objective: Framewise BCE loss with labels from energy-based segmentation (Lea et al., 2022).
False positive abatement: Incorporation of “aggressor” data—speech and environmental noise during training—reduces FP/hour by >98% (from 303.9 to 4.65 on speech) (Lea et al., 2022).
Speaker verification from NVVs: Embedding alignment between verbal speech and NVVs (e.g., laughter) using a two-stage teacher-student framework. $L_{TS} = \Vert E_v(x_v) - E_n(x_n) \Vert^2_2$ (Lin et al., 2023).

Table: NVV Detection Model Structure (Example from (Lea et al., 2022))

Module	Details	Purpose
Log-mel extraction	16kHz, 64-dim, 25ms win, 10ms step	Captures spectral features
TCN	Grouped Conv1D, LeakyReLU, bottlenecks	Learns temporal dynamics
Output	1D Conv + sigmoid, 17 classes	Framewise class probabilities
Post-processing	Threshold- & duration-based event logic	Robust event detection

4. Corpora and Data Diversity

NVV research relies on both acted and in-context corpora:

Large-scale datasets capture population-level diversity and facilitate robust model training. For instance, 100,000+ mouth sound clips from 710 adults, with stratification by accent, age, gender, and device (Lea et al., 2022).
Cross-linguistic corpora (e.g., JNV and JVNV for Japanese; NVTTS and NonVerbalSpeech-38K for English and Mandarin) provide coverage of phrase, emotion, and phoneme diversity (Xin et al., 2023, Xin et al., 2023, Borisov et al., 17 Jul 2025, Ye et al., 7 Aug 2025).
Corpus-based approaches model NVVs in spontaneous, authentic interactions and real-world conversational structure. These methods enable detailed analysis of context, paradigmatic variability (substitutional properties), and sequential use, but contend with data sparsity and privacy concerns (Batliner et al., 3 Aug 2025).
Crowdsourcing, scenario-based emotional elicitation, and fine-grained annotation (including NVV durations and alignment) support more realistic data (Xin et al., 2023, Xin et al., 2023).

5. Applications: Accessibility, Affective Computing, and Expressive Synthesis

NVVs underpin a broad spectrum of practical systems:

Accessible voice interaction: Detection of NVVs such as mouth sounds (“pop”, “click”, “eh”) serves as an alternative input modality for users with speech disorders, enabling trigger actions on mobile devices and achieving high precision/recall (88.6%/88.4%) and low false positives (<0.31 FP/hour) (Lea et al., 2022).
Expressive TTS synthesis: Corpora such as NVTTS (Borisov et al., 17 Jul 2025), NonVerbalSpeech-38K (Ye et al., 7 Aug 2025), and NVSpeech (Liao et al., 6 Aug 2025) allow fine-tuning synthesis models to generate human-like speech with explicit NVVs, including context-aware insertion at arbitrary transcript positions. Models using variable vocabularies and discrete acoustic codes (HuBERT-based, k-means clustering) accommodate the high acoustic variability of NVVs (Xin et al., 2023).
Speaker verification: Laughter and other NVVs can serve as discriminative biometric cues for speaker identification, with well-aligned embeddings improving verification accuracy in cross-modal scenarios (Lin et al., 2023).
Emotion recognition and affective analysis: NVVs encode salient emotional cues (e.g., pleasure, fear, pain, achievement) that are less ambiguous than verbal speech. Classification systems leveraging BoAWs, DeepSpectrum, and auDeep features offer benchmarks for emotion inference, albeit with persistent confusion in overlapping classes (Schuller et al., 2022).
Cinematic audio source separation (CASS): Integrating NVVs in the speech stem of mixed audio leads to improved fidelity in separating speech from background effects and music in movies, with measurable gains in SDR and subjective preference (Hasumi et al., 3 Jun 2025).
Conversational engagement prediction: NVVs, coupled with multimodal behaviors (facial AUs, gestures, backchanneling), vary significantly across cultures and inform engagement models based on LSTM/RNN architectures (Funk et al., 9 Sep 2024).

6. Challenges: Privacy, Sparseness, Modeling Context

Despite their ubiquity, NVVs present unique computational and methodological challenges (Batliner et al., 3 Aug 2025):

Privacy restrictions: Authentic, affective NVVs are often produced in private scenarios; recording and publishing data is constrained by ethical limitations.
Data sparsity: In both theory-driven and corpus-based approaches, strongly affective NVVs comprise only a small fraction of turns, rendering the training of high-quality recognizers difficult.
Contextual ambiguity: NVVs’ multifunctionality means identical forms may serve disparate pragmatic roles; isolating events from their co-occurring linguistic context hampers both interpretability and model generalizability.
Model Malnutrition Disorder (MMD): Overreliance on staged or elicited NVV recordings may result in misleading generalizations to spontaneous real-world interactions.

7. Future Directions and Open Problems

Emerging research emphasizes:

Corpus-driven solutions: Realistic modeling necessitates capturing NVVs in open-microphone, context-rich conversation rather than solely relying on prompted exemplars. This direction is constrained by privacy and annotation scalability (Batliner et al., 3 Aug 2025).
Universal, NVV-specialized representation learning: Foundation models (e.g., voc2vec) pre-trained and fine-tuned on NVV-rich data outperform traditional speech/audio models in classification and clustering across diverse NVV benchmarks (Koudounas et al., 22 Feb 2025).
Explicitly controllable ASR/TTS pipelines: Unified architectures that decode and synthesize NVVs as inline tokens (contextually embedded with lexical speech) facilitate scalable annotation, expressive generation, and context-aware human–machine interaction (Liao et al., 6 Aug 2025).
Cross-cultural and language-specific analysis: Ongoing work extends corpora and models to multiple languages and cultures, revealing differences in NVV production, perception, and emotional mapping (e.g., vowel type distribution, engagement dynamics) (Xin et al., 2023, Funk et al., 9 Sep 2024).
Real-world interface design: NVV-aware technologies are increasingly interpreted as key to building emotionally intelligent, accessible, and culturally sensitive conversational agents and assistive devices.

Conclusion

Non-Verbal Vocalisations are an essential, multifunctional dimension of human acoustic communication. They challenge computational analysis due to privacy, context dependency, and data sparsity, yet advances in corpus building, feature engineering, deep learning, and representation modeling are enabling robust detection, classification, and expressive synthesis of NVVs across diverse application domains. Ongoing research seeks to bridge corpus realism and annotation quality, to extend NVV-aware capabilities in speech technology, affective computing, and multimodal human–computer interaction.