Tri-modal Neighborhood Consistency (TNC)
- Tri-modal Neighborhood Consistency (TNC) is a criterion that jointly evaluates the geometric alignment across acoustics, EEG, and Audio LLM representations.
- It calculates squared Spearman RSA correlations among all modality pairs to mitigate modality-specific noise and confounding effects.
- TNC has been validated on naturalistic speech EEG data, effectively isolating genuine tri-modal correspondence in experimental settings.
Tri-modal Neighborhood Consistency (TNC) is a quantitative criterion for assessing whether the fine-grained representational geometry of an acoustic stimulus is jointly preserved across auditory presentation, neural response, and Audio LLM (Audio LLM) hidden states. In contrast to traditional pairwise alignment methods, TNC requires simultaneous geometric agreement among all three modalities, thereby minimizing the influence of modality-specific noise or shared confounds. The criterion addresses weaknesses in standard similarity metrics and enables principled evaluation of tri-modal alignment in naturalistic speech and EEG studies (Yang et al., 23 Jan 2026).
1. Conceptual Motivation
Standard model–EEG alignment, employing metrics such as @@@@1@@@@ (RSA) and Centered Kernel Alignment (CKA), typically quantifies similarity between two modalities (e.g., model representations and neural data). However, pairwise scores can be artificially enhanced by shared sources of variance (for example, both model and EEG varying with low-level acoustics), or suppressed by noise unique to particular modalities. TNC was introduced to enforce a stricter standard: it only reports strong alignment when the within-sentence neighborhood structure is jointly preserved in (i) acoustic features, (ii) EEG signals, and (iii) Audio LLM hidden states. TNC penalizes situations where the neighborhood geometry matches in only a subset of modality pairs, remaining conservative when spurious correlations arise.
A plausible implication is that TNC more accurately isolates genuine tri-modal correspondence, filtering out effects that occur due to confounds or incomplete representational overlap.
2. Formal Definition and Mathematical Formulation
Given a sentence segmented into time steps, three modalities are defined:
- : sequence of acoustic features,
- : EEG features aligned to the same time grid,
- : model layer- embeddings.
For each modality , a representational dissimilarity matrix (RDM) over all pairs of time indices is computed: where denotes the Pearson correlation over feature dimensions. The strictly upper-triangular entries are vectorized: For every modality pair , Spearman RSA correlation is calculated: The TNC at layer for sentence is: Each , ensuring . Elevated TNC near 1 indicates high Spearman RSA magnitudes for all three modality pairs; if only a single pair manifests high similarity, TNC remains low.
3. Computational Workflow
Analysis proceeds as follows for a given set of sentences and a pretrained Audio LLM of layers:
- Acoustic feature extraction: Compute a time-aligned low-dimensional descriptor sequence (e.g., log-Mel features, PCA-reduced MFCC).
- EEG preprocessing and alignment: Segment raw EEG into sentence epochs, -score the electrodes, and interpolate to the model’s token grid ( steps) for each sentence.
- Model feature extraction: Input the audio into the Audio LLM, record hidden states from all layers, optionally reduce to PCA components per layer.
- RDM computation: For each modality , construct and vectorize to .
- Pairwise RSA: Calculate , , and for each sentence and layer.
- TNC aggregation: Square each , take the mean as per the TNC formula.
- Permutation testing: Optionally, perform significance analysis via time-shuffle permutation in one modality ().
Hyperparameters for TNC experiments include PCA components (), number of permutations (), and prosody clustering parameters (valence threshold , clusters).
4. Integration With Experimental Protocols
TNC has been applied to datasets including “Alice in Wonderland” EEG (84 sentences, 60 channels at 500 Hz) and naturalistic speech EEG (OpenNeuro ds004408: 736 sentences, 128 channels at 512 Hz), with computation across all transformer blocks and specialized prosody analyses using the final layer. For time-resolved analysis, sentences were subdivided into four 250 ms windows, with the token grid mapped uniformly across temporal intervals.
Complementary metrics (Pearson/Kendall RSA, dCor, RV, MI, CKA-Lin/RBF) were evaluated for standard model–EEG alignment but do not contribute directly to TNC. For affective analyses, sentence prosody was quantified via eGeMAPS/openSMILE features (pitch , Hammarberg index, spectral ), -scored, and weighted to form a valence proxy. Sentences with valence below or above were labeled negative or positive, and TNC statistics computed for each affect segment.
5. Interpretative Results and Neurobiological Significance
Affective dissociation was observed: rank-based metrics (Spearman RSA, Kendall’s ) displayed reduced geometric similarity for negative prosody, signifying disrupted neighborhood structure in time-step representations. In contrast, dependence-based metrics (distance correlation, RV, CKA) increased under negative prosody, implying stronger global covariance despite noisier relative ordering.
Prosody-based clustering () revealed highest TNC for pairs involving the model, intermediate for acoustics–EEG, and lowest for EEG–model, with energy regimes affecting TNC dispersion. High-energy sentences yielded more variable tri-modal coherence; low-energy/longer-duration sentences showed tighter TNC.
The time-resolved analysis indicated model–EEG geometric alignment (Spearman RSA) peaks in the 250–500 ms interval, consistent with N400 semantic integration. The affective dissociation—geometry weakened but dependence strengthened under negative prosody—parallels literature findings that attentional and affective processing enhance broad neural coupling at the expense of fine-scaled neighborhood structure (Vuilleumier, 2005).
6. Methodological Implications and Extensions
TNC provides a robust criterion for tri-modal alignment, effectively guarding against chance pairwise effects and conservatively quantifying joint correspondence. Its operational simplicity—three Spearman RSA calls and mean squared aggregation—facilitates direct implementation. A plausible implication is that TNC’s generality admits further extension: alternative neighborhood metrics may replace RSA, or additional modalities (e.g., fMRI source estimates) may be incorporated for high-dimensional representational analysis.
This suggests TNC can serve as a template for multi-modal consistency assessment beyond audio–EEG–model trios, contingent on appropriately time-aligned feature extraction and RDM construction.
7. Limitations and Prospective Directions
TNC’s dependence on Spearman RSA may limit sensitivity to certain nonlinear neighborhood relationships; future work could evaluate the criterion’s behavior with advanced metrics or transfer to domains such as vision or text. Extending TNC to other physiological modalities and testing its susceptibility to confounding variables remain important research avenues.
A plausible implication is that the TNC criterion could drive development of unified benchmarking suites for tri-modal and multimodal model–brain–stimulus alignment, fostering more interpretable and neurobiologically motivated machine learning models.