Voice Activity Projection (VAP)

Updated 9 March 2026

Voice Activity Projection (VAP) is a self-supervised, frame-incremental model that predicts joint future voice activity patterns from past audio and VAD signals.
It employs a multi-stage encoder–transformer architecture with self- and cross-attention layers to capture fine-grained prosodic, segmental, and contextual cues in both dyadic and multi-party dialogues.
VAP enables real-time, low-latency turn-taking predictions validated by high F1 scores and robust performance across multiple languages and noisy environments.

Voice Activity Projection (VAP) is a self-supervised, frame-incremental predictive modeling paradigm for turn-taking in spoken dialogue. It formalizes turn-taking as the joint anticipation of who will be speaking—at what points in time—over a short future horizon, using only acoustic and low-level contextual signals as input. The VAP framework is designed to capture fine-grained trajectory patterns of voice activity, including turn-shifts, holds, backchannels, and overlapping speech, thereby enabling continuous, low-latency prediction of interactional timing in both dyadic and multi-party contexts. VAP models are theoretically grounded, architecturally modular, and extensively validated on diverse dialogue corpora and interactive systems.

1. Formal Definition and Mathematical Formulation

At each frame $t$ , the VAP model receives as input all past audio $\mathbf{x}_{1:t}$ and voice activity indicators $\mathbf{v}_{1:t}$ (mixed or per-channel), and must output a distribution over possible patterns of future voice activity for all interlocutors across a fixed projection window. Typically, this window $T$ is set to 2 s, discretized into $K$ bins (commonly four per speaker), and each bin encodes a coarse binary “voiced/unvoiced” state (active if more than half its frames are voiced). In the canonical dyadic case, this results in $C = 2^K = 256$ possible joint future patterns. The model learns $p(c_t | \mathbf{x}_{1:t}, \mathbf{v}_{1:t})$ , where $c_t \in \{0, ..., 255\}$ indexes the future activity pattern or state class (Ekstedt et al., 2022, Ekstedt et al., 2022).

The primary training objective is self-supervised cross-entropy loss over the predicted state sequence:

$\mathcal{L} = - \log p(c_t^{*} | \mathbf{x}_{1:t})$

where $c_t^{*}$ is derived from ground-truth voice activity extracted via VAD.

2. Model Architectures and Incremental Processing

The VAP paradigm is compatible with several architectural variants, but converges on a multi-stage, encoder–transformer–head pipeline:

Audio Encoder: Pre-trained Contrastive Predictive Coding (CPC) network operating on 16 kHz waveforms, outputting frame-wise embeddings (commonly 256- or 512-dimensional) for each channel. These embeddings encapsulate low-level, prosodically rich features, and may be fine-tuned or frozen during VAP training (Inoue et al., 2024, Saga et al., 4 Jun 2025).
Contextual Feature Augmentation: Current and historical voice-activity indicators (including per-speaker ratios across windows) are projected into dense vectors and summed or concatenated with the CPC representations.
Self-Attention and Cross-Attention Layers: One or more transformer encoder layers per channel capture local temporal dependencies. Subsequently, multi-layer cross-attention transformers allow each participant’s sequence to attend to the context of all others, modeling interactional contingencies (overlap, silence triggers, etc.) (Inoue et al., 2024, Elmers et al., 10 Jul 2025).
Prediction Heads:
- VAP Head: Output is a 256-way (dyadic) or 64-way (triadic) softmax for the future-activity state, or per-bin Bernoulli outputs for independent binning.
- VAD Head: Auxiliary heads compute current voice activity per channel.
- Multimodal Heads (optional): Additional predictors may operate on facial, gaze, or gesture features (Saga et al., 4 Jun 2025).
Causal Incrementality: All operations are strictly forward in time, supporting real-time inference.

3. Prosodic, Segmental, and Multimodal Information Utilization

VAP models implicitly leverage multiple dimensions of conversational cues:

Prosody: Systematic perturbations isolating pitch (F0 flattening and shifting), intensity (energy flattening), and segmental detail (low-pass filtering) reveal that VAP models exploit both pitch and energy, with low-level phonetic detail being critical for robust shift and backchannel prediction. Energy cues play a role equal to or exceeding pitch, particularly for long-form predictions; segmental degradation collapses performance near to baseline, especially for shift/hold and backchannel tasks (Ekstedt et al., 2022).
Lexico-syntactic Ambiguity: VAP’s sensitivity to pitch is amplified when lexical content is ambiguous (e.g., short/long question pairs with identical phrasing but diverging prosodic completions).
Multimodal Inputs: Extensions incorporating facial expression encoders (e.g., Vision Transformers trained on dynamic facial datasets), head/gaze/body pose, and even action unit time series demonstrably boost performance in backchannel and shift-prediction metrics. Subtle facial cues—micro-smiles, eyebrow raises—are dynamically weighted via gating mechanisms, with learned fusion outperforming both unimodal and simple action-unit-based baselines (Saga et al., 4 Jun 2025).

4. Evaluation, Metrics, and Comparative Performance

VAP is evaluated on a spectrum of turn-taking tasks:

Shift/Hold: Predict, during mutual silence, whether the turn will shift or hold. VAP yields high weighted F1 (e.g., 0.899 on Switchboard, dyadic English) and is consistently superior to independent and comparative models, especially for backchannel and shift-prediction where joint modeling is vital (Ekstedt et al., 2022).
Shift Prediction: F1 scores of about 0.733 are typical, with low-pass and intensity perturbations causing the largest drops (Ekstedt et al., 2022).
Backchannel Prediction: VAP's joint pattern encoding is necessary to detect short, isolated backchannels. Ablations confirm intensity cues supersede pitch for backchannel timing/type (Inoue et al., 2024).
Real-time Inference: VAP models support sub-100 ms decision latency with a 1 s transformer context window, achieving 76.2% balanced accuracy on triadic Japanese test sets at less than real-time CPU cost (Inoue et al., 2024).
Comparison with LLMs: In multi-modal LLM–VAP ensembles (e.g., Lla-VAP), VAP dominates audio timing sensitivity for turn-final detection, while LLMs add linguistic specificity. Fused (LSTM) ensembles reach up to 93.2% accuracy for turn-relevance places (TRP) on complex U.S. English conversational datasets (Jeon et al., 2024).

5. Extensions: Multilingual, Multi-party, and Robustification

VAP generalizes to new settings and input modalities:

Multilingual Modeling: Training on pooled English, Mandarin, and Japanese dialogue corpora yields a single VAP model matching monolingual benchmarks and delivering robust performance in cross-lingual transfer. In tonal/intonational languages, pitch flattening has a substantially larger adverse effect than in English, confirming language-specific prosodic weighting (Inoue et al., 2024).
Triadic and Multi-party Projection: For $\mathbf{x}_{1:t}$ 0 speakers, the projection window is partitioned into $\mathbf{x}_{1:t}$ 1 bins per speaker, yielding $\mathbf{x}_{1:t}$ 2 joint states. In triadic Japanese conversation ( $\mathbf{x}_{1:t}$ 3 classes), VAP achieves next-speaker prediction accuracy of up to 87.5%. Conversation style (spontaneous vs. attentive listening) modulates accuracy, with less overlap and more predictable backchannels in structured settings (Elmers et al., 10 Jul 2025).
Noise Robustness: Field deployment in a shopping mall demonstrates that VAP models, augmented with multi-condition training (CHiME4/DEMAND/MUSAN noise at various SNRs), consistently reduce response latencies (robot: 2.14→0.71 s) and improve subjective fluidity, even under adverse acoustic conditions (Inoue et al., 8 Mar 2025).
Human–Robot Integration: Fusion of VAP with pragmatic completion models (e.g., TurnGPT) in HRI systems reduces both delay and interruption rates, yielding preference over silence-threshold baselines in a majority of human users (Skantze et al., 15 Jan 2025).

6. Theoretical, Methodological, and Implementation Considerations

Joint vs. Independent Modeling: Modeling joint future patterns (via a discrete, enumerated state space) provides statistically principled handling of trajectory dependencies in overlapping, backchannel, or ambiguous contexts. Independent bin-wise prediction cannot capture non-factorizable conversational motifs (Ekstedt et al., 2022).
Incremental, Causal Inference: VAP maintains strict causal eligibility, supporting frame-synchronous inference and integration into live dialogue systems.
Training Regimes: VAP is self-supervised by construction, relying on automatically obtained VAD signals and dispensing with explicit annotation of turn boundaries.
Hyperparameters and Context Windows: Short attention contexts (e.g., 1 s) suffice for high accuracy, with longer windows not improving performance and incurring quadratic resource growth.
Encoder Choices: CPC encoders pre-trained on Librispeech are favored over CNN-only architectures (e.g., MMS), with the latter underperforming when frozen and overfitting if fine-tuned (Inoue et al., 2024).

7. Limitations, Controversies, and Future Research

While VAP sets a benchmark for conversational timing prediction, limitations remain:

Class Explosion: $\mathbf{x}_{1:t}$ 4 scaling of joint state space constrains extension to larger groups; factorization and state-space reduction strategies are needed for broad multiparty deployment.
Cross-domain Generalization: Field tests are predominantly in Japanese and English. The impact of sociophonetic and cultural variation, and transfer learning across domains, is not comprehensively addressed (Inoue et al., 8 Mar 2025).
Behavioral and Pragmatic Integration: VAP's effect on deep conversational adaptation (e.g., speaker strategy or engagement) is not fully reflected in behavioral endpoints, despite improved subjective ratings.
Multimodal cues: Fine-grained integration of visual (gaze, facial, body, pose) and lexical cues is promising but still under exploration (Saga et al., 4 Jun 2025).

Further work involves robust scaling to more parties, enriched multi-modal input, improved noise robustness, on-device quantization, and deeper integration with end-to-end dialogue management frameworks.