DeepSeek-VL2 and TRILLsson Combination
- The paper presents a novel DIVINE architecture that fuses DeepSeek-VL2 and TRILLsson embeddings through hierarchical VAEs and sparse gated fusion for joint neuro-facial disorder prediction.
- It employs a two-level VAE approach to disentangle local and utterance-level latent representations, yielding interpretable shared and modality-specific features.
- Empirical results show significant gains in classification accuracy and F1 scores, demonstrating robust performance under both full and partial modality inputs.
The combination of DeepSeek-VL2 (visual) and TRILLsson (audio) embeddings represents a state-of-the-art approach for multimodal neuro-facial disorder assessment within the DIVINE framework. This architecture leverages hierarchical disentanglement, adaptive fusion with sparse gating, and learnable clinical symptom representations. The approach performs joint prediction of disorder class and severity, handling synchronized speech and facial video inputs to achieve superior results, generalization to single-modality scenarios, and interpretable clinical representations (Akhtar et al., 11 Jan 2026).
1. Modal Embedding Extraction: DeepSeek-VL2 and TRILLsson
DIVINE utilizes frozen, pretrained foundation models for both visual and audio modalities to extract robust latent representations of neuro-facial data.
- DeepSeek-VL2: The “base” DeepSeek-VL2 vision encoder (with weights frozen during DIVINE training) yields per-clip embeddings , taken from the final block of the vision transformer (prior to MoE gating), with .
- TRILLsson: The distilled TRILLsson encoder (also frozen) produces audio embeddings , sourced from the last hidden states, with .
These embeddings serve as the basis for subsequent disentanglement and fusion operations.
| Model | Output Dimension | Extraction Layer |
|---|---|---|
| DeepSeek-VL2 | Final ViT block (pre–MoE gating) | |
| TRILLsson | Last encoder hidden state |
2. Hierarchical Disentanglement via Two-Level VAEs
DIVINE introduces a dual-stage VAE disentanglement procedure, operating at both local and global levels for each modality.
- Local-window VAE: Each input sequence is windowed, and for window and modality :
- Posterior parameters: .
- Latent sample: , .
- Reconstruction: .
- Loss: .
- Temporal pooling yields the global vector: .
- Utterance-level VAE: For each , two parallel encoders decompose the latent into:
- Shared latent (), with parameters tied across modalities.
- Private latent (), modality-specific.
- The full loss includes reconstruction, -weighted KL for shared, -weighted KL for private latents:
where , are chosen by validation.
This structure disentangles shared and modality-specific sources at multiple temporal scales, enhancing interpretability and generalization for clinical assessments.
3. Sparse Gated Fusion and Clinical Token Injection
Following disentanglement, the system adaptively fuses latent spaces and integrates clinical priors.
Sparse gated fusion: For private encodings , :
- Gates: , (elementwise sigmoid).
- Fused representation: .
- Sparsity is regularized with .
- Learnable symptom tokens: The fusion vector is prepended with learned “symptom tokens” :
- Sequence: .
- A transformer-like dense block produces , with a token-specialization penalty .
This layered fusion enables interpretability (by relating features to clinical symptom axes) and provides robustness to missing modalities.
4. Multitask Prediction and Aggregate Loss
The architecture supports joint diagnosis and severity scoring through multitask output heads.
- Heads: Classification and severity, with softmax outputs:
where is the output from the fused representation post-dense block.
- Losses:
- Cross-entropy for classification () and severity ()
- Cycle-consistency () aligns shared latents.
- Sparse gating and token penalties as above.
- Full VAE reconstruction and KL objectives.
The total loss is:
with fixed hyperparameters , , .
5. Training Protocol and Hyperparameterization
The model is trained and evaluated on the Toronto NeuroFace dataset using subject-wise five-fold cross-validation.
- Optimization: Adam optimizer, learning rate , batch size 32, up to 50 epochs with early stopping.
- Model size: Fusion models contain 3.5–6.5M trainable parameters; unimodal variants 1M.
- CNN refinement: Each backbone includes two 1D convolutional blocks and fully connected layers for embedding refinement.
- Regularization: Dropout and regularization are applied to output heads as required.
These choices are calibrated for both convergence and robust generalization under full or partial modality input regimes.
6. Empirical Performance and Ablative Analyses
The DIVINE framework using DeepSeek-VL2 and TRILLsson embeddings achieves strong performance and demonstrates the impact of each architectural component through extensive ablation studies.
- Unimodal CNN accuracy (multitask):
- DeepSeek-VL2: 88.94%
- TRILLsson: 90.51%
- Naive DS+TR concatenation: 94.65% accuracy, 93.87% F1
- Full DIVINE (DS+TR): 98.26% accuracy, 97.51% F1
- Modality-Constrained Regimes:
- Audio only: 89.27% / 88.23%
- Video only: 84.34% / 83.20%
- Regularization Ablation (DS+TR):
- w/o cycle-consistency: 96.14% / 94.95%
- w/o sparse gate: 95.83% / 94.21%
- w/o token loss: 95.62% / 93.89%
- Bottleneck Ablation:
- Flat fusion: 93.87% / 92.10%
- Single-level VAE: 95.22% / 93.80%
- Two-level (full) DIVINE: 98.26% / 97.51%
This empirical evidence quantifies the contribution of hierarchical disentanglement, sparse gating, and symptom tokenization relative to baseline encoders and naive fusion.
7. Context, Implications, and Outlook
The DIVINE framework, as the first approach to integrate cross-modal disentanglement, adaptive sparse gating, and multitask predictive heads for oro-facial neurological assessment, establishes a new empirical standard for multimodal fusion using DeepSeek-VL2 and TRILLsson representations (Akhtar et al., 11 Jan 2026).
Its design enables:
- Clinical interpretability through explicit shared/private latent decomposition.
- Robustness to missing modalities via sparse gating and multitask heads.
- Superior accuracy and F1 compared to unimodal approaches and simple fusion baselines, particularly in challenging cross-modality clinical settings.
A plausible implication is that this paradigm of multimodal representation disentanglement, dense fusion informed by symptom priors, and multitask learning may extend beyond neuro-facial disorder diagnostics to other clinical, behavioral, or affective computing domains requiring joint modeling of speech and facial data.